Javascript must be enabled to continue!

Publications


2025
The impact of mother's mental health, infant characteristics and war trauma on the acoustic features of infant-directed singing

Raija-Leena Punamäki, Safwat Y. Diab, Konstantinos Drosos, and Samir R. Qouta, “The impact of mother’s mental health, infant characteristics and war trauma on the acoustic features of infant‐directed singing,” Infant Mental Health Journal, 2025

Infant-directed singing (IDSi) is a natural means of dyadic communication that contributes to children's mental health by enhancing emotion expression, close relationships, exploration and learning. Therefore, it is important to learn about factors that impact the IDSi. This study modeled the mother- (mental health), infant- (emotional responses and health status) and environment (war trauma)-related factors influencing acoustic IDSi features, such as pitch (F0) variability, amplitude and vibration and the F0 contour of shapes and movements. The participants were 236 mothers and infants from Gaza, the Occupied Palestinian Territories. The mothers reported their mental health problems, infants’ emotionality and regulation skills, and, along with pediatric checkups, illnesses and disorders, as well as traumatic war events that were also photo documented. The results showed that the mothers’ mental health problems and infants’ poor health status were associated with IDSi, characterized by narrow and lifeless amplitude and vibration, and poor health was also associated with the limited and rigid shapes and movements of F0 contours. Traumatic war events were associated with flat and narrow F0 variability and the monotonous and invariable resonance and rhythm of IDSi formants. The infants’ emotional responses did not impact IDSi. The potential of protomusical singing to help war-affected dyads is discussed.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 21-09-2025 17:17 - Size: 517.32 KB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 21-09-2025 17:17 - Size: 363 B
BibTex Record (Popup)
Copy the citation
2024
Adversarial Representation Learning for Robust Privacy Preservation in Audio

S. Gharib, M. Tran, D. Luong, K. Drossos and T. Virtanen, "Adversarial Representation Learning for Robust Privacy Preservation in Audio," in IEEE Open Journal of Signal Processing, vol. 5, pp. 294-302, 2024

Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings. The proposed method trains a model to generate invariant latent representations of speech-containing audio recordings that cannot be distinguished from non-speech recordings by a speech classifier. The novelty of our work is in the optimization algorithm, where the speech classifier's weights are regularly replaced with the weights of classifiers trained in a supervised manner. This increases the discrimination power of the speech classifier constantly during the adversarial training, motivating the model to generate latent representations in which speech is not distinguishable, even using new speech classifiers trained outside the adversarial training loop. The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method, demonstrating a significant reduction in privacy violations compared to the baseline approach. Additionally, we show that the prior adversarial method is practically ineffective for this purpose.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 21-09-2025 16:56 - Size: 4.97 MB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 21-09-2025 16:56 - Size: 326 B
BibTex Record (Popup)
Copy the citation
The role of acoustic features of maternal infant-directed singing in enhancing infant sensorimotor, language and socioemotional development

R.-L. Punamäki, S. Y. Diab, K. Drosos, S. R. Qouta, and M. Vänskä, “The role of acoustic features of maternal infant-directed singing in enhancing infant sensorimotor, language and socioemotional development,” Infant Behavior and Development, vol. 74, p. 101908, 2024

The quality of infant-directed speech (IDS) and infant-directed singing (IDSi) are considered vital to children, but empirical studies on protomusical qualities of the IDSi influencing infant development are rare. The current prospective study examines the role of IDSi acoustic features, such as pitch variability, shape and movement, and vocal amplitude vibration, timbre, and resonance, in associating with infant sensorimotor, language, and socioemotional development at six and 18 months. The sample consists of 236 Palestinian mothers from Gaza Strip singing to their six-month-olds a song by their own choice. Maternal IDSi was recorded and analyzed by the OpenSMILE- tool to depict main acoustic features of pitch frequencies, variations, and contours, vocal intensity, resonance formants, and power. The results are based on completed 219 maternal IDSi. Mothers reported about their infants’ sensorimotor, language-vocalization, and socioemotional skills at six months, and psychologists tested these skills by Bayley Scales for Infant Development at 18 months. Results show that maternal IDSi characterized by wide pitch variability and rich and high vocal amplitude and vibration were associated with infants’ optimal sensorimotor, language vocalization, and socioemotional skills at six months, and rich and high vocal amplitude and vibration predicted these optimal developmental skills also at 18 months. High resonance and rhythmicity formants were associated with optimal language and vocalization skills at six months. To conclude, the IDSi is considered important in enhancing newborn and risk infants’ wellbeing, and the current findings argue that favorable acoustic singing qualities are crucial for optimal multidomain development across infancy.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 21-09-2025 17:01 - Size: 699.5 KB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 21-09-2025 17:01 - Size: 401 B
BibTex Record (Popup)
Copy the citation
2023
Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment

Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Liisa Lehtonen, and Okko Räsänen, “Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment,” Speech Communication, vol. 148, pp. 9-22, 2023

In order to study how early emotional experiences shape infant development, one approach is to analyze the emotional content of speech heard by infants, as captured by child-centered daylong recordings, and as analyzed by automatic speech emotion recognition (SER) systems. However, since large-scale daylong audio is initially unannotated and differs from typical speech corpora from controlled environments, there are no existing in-domain SER systems for the task. Based on existing literature, it is also unclear what is the best approach to deploy a SER system for a new domain. Consequently, in this study, we investigated alternative strategies for deploying a SER system for large-scale child-centered audio recordings from a neonatal hospital environment, comparing cross corpus generalization, active learning (AL), and domain adaptation (DA) methods in the process. We first conducted simulations with existing emotion-labeled speech corpora to find the best strategy for SER system deployment. We then tested how the findings generalize to our new initially unannotated dataset. As a result, we found that the studied AL method provided overall the most consistent results, being less dependent on the specifics of the training corpora or speech features compared to the alternative methods. However, in situations without the possibility to annotate data, unsupervised DA proved to be the best approach. We also observed that deployment of a SER system for real-world daylong child-centered audio recordings achieved a SER performance level comparable to those reported in literature, and that the amount of human effort required for the system deployment was overall relatively modest.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 21-09-2025 16:40 - Size: 1.25 MB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 21-09-2025 16:45 - Size: 377 B
BibTex Record (Popup)
Copy the citation
2021
Design Recommendations for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices

E. Rovithis, N. Moustakas, K. Vogklis, K. Drossos, and A. Floros, "Design Recommendations for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices," in Journal of the Audio Engineering Society, vo. 69 (12), pp. 956-966, 2021, doi:

Citizen Science aims to engage people in research activities on important issues related to their well-being. Smart Cities aim to provide them with services that improve the quality of their life. Both concepts have seen significant growth in the last years and can be further enhanced by combining their purposes with Internet of Things technologies that allow for dynamic and large-scale communication and interaction. However, exciting and retaining the interest of participants is a key factor for such initiatives. In this paper we suggest that engagement in Citizen Science projects applied on Smart Cities infrastructure can be enhanced through contextual and structural game elements realized through augmented audio interactive mechanisms. Our interdisciplinary framework is described through the paradigm of a collaborative bird call recognition game, in which users collect and submit audio data that are then classified and used for augmenting physical space. We discuss the Playful Learning, Internet of Audio Things, and Bird Monitoring principles that shaped the design of our paradigm and analyze the design issues of its potential technical implementation.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 15-03-2022 08:53 - Size: 3.53 MB
Attachment language: English File type: BiBTex LaTeX BibTeX record (.bib)
Updated: 15-03-2022 08:55 - Size: 448 B
BibTex Record (Popup)
Copy the citation
Enriched Music Representations with Multiple Cross-modal Contrastive Learning

A. Ferraro, X. Favory, K. Drossos, Y. Kim and D. Bogdanov, "Enriched Music Representations with Multiple Cross-modal Contrastive Learning," in IEEE Signal Processing Letters, doi: 10.1109/LSP.2021.3071082.

Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks' audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 08-04-2021 11:35 - Size: 548.7 KB
Attachment language: English File type: BiBTex LaTeX BibTeX record (.bib)
Updated: 08-04-2021 11:36 - Size: 385 B
BibTex Record (Popup)
Copy the citation
2019
Examining the Mapping Functions of Denoising Autoencoders in Music Source Separation

Stylianos Ioannis Mimilakis, Konstantinos Drossos, Estefanía Cano, and Geralrd Schuller, “Examining the Mapping Functions of Denoising Autoencoders in Music Source Separation,” in IEEE/ACM Transaction on Audio, Speech, and Language Processing (TASLP), vol. 28, pp 262-278, 2019.

The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures.

Attachment language: English File type: BiBTex LaTeX BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 331 B
BibTex Record (Popup)
Copy the citation
2016
On the Impact of the Semantic Content of Sound Events in Emotion Elicitation

Konstantinos Drossos, Maximos Kaliakatsos-Papakostas, Andreas Floros, and Tuomas Virtanen, “On the Impact of the Semantic Content of Sound Events in Emotion Elicitation,” Journal of Audio Engineering Society, Vol. 64, No. 7/8, pp. 525–532, 2016

Sound events are proven to have an impact on the emotions of the listener. Recent works on the field of emotion recognition from sound events show, on one hand, the possibility of automatic emotional information retrieval from sound events and, on the other hand, the need for deeper understanding of the significance of the sound events’ semantic content on listener’s affective state. In this work we present a first, to the best of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal-Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 02-11-2019 12:29 - Size: 876.85 KB
Attachment language: English File type: BiBTex LaTeX BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 393 B
BibTex Record (Popup)
Copy the citation
2015
Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence

Konstantinos Drossos, Andreas Floros, and Katia Kermanidis, “Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence”, Journal of Audio Engineering Society, Vol. 63, No. 3, pp. 139–153, 2015

While modern sound researchers generally focus on speech and music, mammalian hearing arose from the need to sense those events in the environment that produced sound waves. Such unorganized sound stimuli, referred to as Sound Events (SEs), can also produce an affective and emotional response. In this research, the investigators explore valence recognition of SEs utilizing rhythm-related acoustics cues. A well-known data set with emotionally annotated SEs was employed; various rhythm-related attributes were then extracted and several machine-learning experiments were conducted. The results portray that the rhythm of a SE can affect the listener’s valence up to an extent and, combined with previous works on SEs, could lead to a comprehensive recognition of the rhythm’s effect on the emotional state of the listener.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 02-11-2019 12:24 - Size: 685.04 KB
Attachment language: English File type: BiBTex LaTeX BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 380 B
BibTex Record (Popup)
Copy the citation
Investigating the Impact of Sound Angular Position on the Listener Affective State

Konstantinos Drossos, Andreas Floros, Andreas Giannakoulopoulos, and Nikolaos Kanellopoulos, “Investigating the Impact of Sound Angular Position on the Listener Affective State”, IEEE Transactions on Affective Computing, Vol. 6, No. 1, pp. 27–42, 2015

Emotion recognition from sound signals represents an emerging field of recent research. Although many existingworks focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition fromgeneral sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the sourcerelatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and theelicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener’s emotional state, modeled in the well–established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 02-11-2019 12:11 - Size: 919.79 KB
Attachment language: English File type: BiBTex LaTeX BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 408 B
BibTex Record (Popup)
Copy the citation
2011
Emotional Control and Visual Representation Using Advanced Audiovisual Interaction

Vassilis Psarras, Andreas Floros, Konstantinos Drossos, and Marianne Strapatsakis, “Emotional Control and Visual Representation Using Advanced Audiovisual Interaction”, International Journal of Arts and Technology, Vol. 4 (4), 2011, pp. 480-498

Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations with non-traditional interaction/feedback paths based on user affective state. In this work, the ‘Elevator’ interactive audiovisual platform prototype is presented, which aims to provide a framework for signalling and expressing human behaviour related to emotions (such as anger) and finally produce a visual outcome of this behaviour, defined here as the emotional ‘thumbnail’ of the user. Optimised, real-time audio signal processing techniques are employed for monitoring the achieved anger-like behaviour, while the emotional elevation is attempted using appropriately selected combined audio/visual content reproduced using state-of-the-art audiovisual playback technologies that allow the creation of a realistic immersive audiovisual environment. The demonstration of the proposed prototype has shown that affective interaction is possible, allowing the further development of relative artistic and technological applications.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 02-11-2019 12:02 - Size: 950.71 KB
Attachment language: English File type: BiBTex LaTeX BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 354 B
BibTex Record (Popup)
Copy the citation