Javascript must be enabled to continue!

Publications

Keyword: speech (3) Back

2025
Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers [Conference]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, “Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers," in INTERSPEECH 2025, Rotterdam, Netherlands, 2025

This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existing methods by introducing an attractor-based architecture that effectively combines local and global temporal modeling for multi-utterance scenarios. To evaluate the method in reverberant and noisy conditions, a multi-speaker multi-utterance dataset was synthesized by combining Librispeech speech signals with WHAM! noise signals. The results demonstrate that the proposed system accurately estimates the number of sources. The system effectively detects source activities and separates the corresponding utterances into correct outputs in both known and unknown source count scenarios.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 23-09-2025 10:45 - Size: 1.5 MB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 23-09-2025 10:45 - Size: 276 B
BibTex Record (Popup)
Copy the citation
2023
Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment [Journal]

Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Liisa Lehtonen, and Okko Räsänen, “Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment,” Speech Communication, vol. 148, pp. 9-22, 2023

In order to study how early emotional experiences shape infant development, one approach is to analyze the emotional content of speech heard by infants, as captured by child-centered daylong recordings, and as analyzed by automatic speech emotion recognition (SER) systems. However, since large-scale daylong audio is initially unannotated and differs from typical speech corpora from controlled environments, there are no existing in-domain SER systems for the task. Based on existing literature, it is also unclear what is the best approach to deploy a SER system for a new domain. Consequently, in this study, we investigated alternative strategies for deploying a SER system for large-scale child-centered audio recordings from a neonatal hospital environment, comparing cross corpus generalization, active learning (AL), and domain adaptation (DA) methods in the process. We first conducted simulations with existing emotion-labeled speech corpora to find the best strategy for SER system deployment. We then tested how the findings generalize to our new initially unannotated dataset. As a result, we found that the studied AL method provided overall the most consistent results, being less dependent on the specifics of the training corpora or speech features compared to the alternative methods. However, in situations without the possibility to annotate data, unsupervised DA proved to be the best approach. We also observed that deployment of a SER system for real-world daylong child-centered audio recordings achieved a SER performance level comparable to those reported in literature, and that the amount of human effort required for the system deployment was overall relatively modest.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 21-09-2025 16:40 - Size: 1.25 MB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 21-09-2025 16:45 - Size: 377 B
BibTex Record (Popup)
Copy the citation
2021
Automatic analysis of the emotional content of speech in daylong child-centered recordings from a neonatal intensive care unit [Conference]

E. Vaaras, S. Ahlqvist-Björkroth, K. Drossos, O. Räsänen, "Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit", in Proceedings of Interspeech 2021, pp. 3380-3384, Brno, Czech Republic, 2021

Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.

Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 08-09-2025 13:00 - Size: 383 B
Attachment language: English File type: PDF document Paper (.pdf)
Updated: 08-09-2025 13:00 - Size: 498.29 KB
BibTex Record (Popup)
Copy the citation