Publications

Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment [Journal]

Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Liisa Lehtonen, and Okko Räsänen, “Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment,” Speech Communication, vol. 148, pp. 9-22, 2023

In order to study how early emotional experiences shape infant development, one approach is to analyze the emotional content of speech heard by infants, as captured by child-centered daylong recordings, and as analyzed by automatic speech emotion recognition (SER) systems. However, since large-scale daylong audio is initially unannotated and differs from typical speech corpora from controlled environments, there are no existing in-domain SER systems for the task. Based on existing literature, it is also unclear what is the best approach to deploy a SER system for a new domain. Consequently, in this study, we investigated alternative strategies for deploying a SER system for large-scale child-centered audio recordings from a neonatal hospital environment, comparing cross corpus generalization, active learning (AL), and domain adaptation (DA) methods in the process. We first conducted simulations with existing emotion-labeled speech corpora to find the best strategy for SER system deployment. We then tested how the findings generalize to our new initially unannotated dataset. As a result, we found that the studied AL method provided overall the most consistent results, being less dependent on the specifics of the training corpora or speech features compared to the alternative methods. However, in situations without the possibility to annotate data, unsupervised DA proved to be the best approach. We also observed that deployment of a SER system for real-world daylong child-centered audio recordings achieved a SER performance level comparable to those reported in literature, and that the amount of human effort required for the system deployment was overall relatively modest.

https://www.sciencedirect.com/science/article/pii/S0167639323000262

Paper (.pdf)
Updated: 21-09-2025 16:40 - Size: 1.25 MB

BibTex record (.bib)
Updated: 21-09-2025 16:45 - Size: 377 B

Representation Learning for Audio Privacy Preservation Using Source Separation and Robust Adversarial Learning [Conference]

D. Luong, M. Tran, S. Gharib, K. Drossos and T. Virtanen, "Representation Learning for Audio Privacy Preservation Using Source Separation and Robust Adversarial Learning," IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), NY, USA, 2023

Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system’s operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non speech recordings. Initially, the source separation network filters out some of the privacy-sensitive data, and during the adversarial learning process, the system will learn privacy-preserving representation on the filtered signal. We demonstrate the effectiveness of our proposed method by comparing our method against systems without source separation, without adversarial learning, and without both. Overall, our results suggest that the proposed system can significantly improve speech privacy preservation compared to that of using source separation or adversarial learning solely while maintaining good performance in the acoustic monitoring task.

https://ieeexplore.ieee.org/document/10248153

Paper (.pdf)
Updated: 21-09-2025 16:50 - Size: 1.6 MB

BibTex record (.bib)
Updated: 21-09-2025 16:50 - Size: 382 B

Privacy-preserving sound representation [Patents]

T. Virtanen, T. Heittola, S. Zhao, S. Gharib, and K. Drosos, “Privacy-preserving sound representation,” U.S. Patent US20230317086A1, filed Oct. 5, 2022; published Oct. 12, 2023

According to an example embodiment, a method (200) for audio-based monitoring is provided, the method (200) comprising: deriving (202), via usage of a predefined conversion model (M), based on audio data that represents sounds captured in a monitored space, one or more audio features that are descriptive of at least one characteristic of said sounds; identifying (204) respective occurrences of one or more predefined acoustic events in said space based on the one or more audio features; and carrying out (206), in response to identifying an occurrence of at least one of said one or more predefined acoustic events, one or more predefined actions associated with said at least one of said one or more predefined acoustic events, wherein said conversion model (M) is trained to provide said one or more audio features such that they include information that facilitates identification of respective occurrences of said one or more predefined acoustic events while preventing identification of speech characteristics.

BiBTeX Record

BiBTeX Record

Konstantinos Drossos®

Konstantinos Drossos^®