Raija-Leena Punamäki, Safwat Y. Diab, Konstantinos Drosos, and Samir R. Qouta, “The impact of mother’s mental health, infant characteristics and war trauma on the acoustic features of infant‐directed singing,” Infant Mental Health Journal, 2025
Infant-directed singing (IDSi) is a natural means of dyadic communication that contributes to children's mental health by enhancing emotion expression, close relationships, exploration and learning. Therefore, it is important to learn about factors that impact the IDSi. This study modeled the mother- (mental health), infant- (emotional responses and health status) and environment (war trauma)-related factors influencing acoustic IDSi features, such as pitch (F0) variability, amplitude and vibration and the F0 contour of shapes and movements. The participants were 236 mothers and infants from Gaza, the Occupied Palestinian Territories. The mothers reported their mental health problems, infants’ emotionality and regulation skills, and, along with pediatric checkups, illnesses and disorders, as well as traumatic war events that were also photo documented. The results showed that the mothers’ mental health problems and infants’ poor health status were associated with IDSi, characterized by narrow and lifeless amplitude and vibration, and poor health was also associated with the limited and rigid shapes and movements of F0 contours. Traumatic war events were associated with flat and narrow F0 variability and the monotonous and invariable resonance and rhythm of IDSi formants. The infants’ emotional responses did not impact IDSi. The potential of protomusical singing to help war-affected dyads is discussed.
Y. Wang, A. Politis, K. Drossos, and T. Virtanen, “Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers," in INTERSPEECH 2025, Rotterdam, Netherlands, 2025
This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existing methods by introducing an attractor-based architecture that effectively combines local and global temporal modeling for multi-utterance scenarios. To evaluate the method in reverberant and noisy conditions, a multi-speaker multi-utterance dataset was synthesized by combining Librispeech speech signals with WHAM! noise signals. The results demonstrate that the proposed system accurately estimates the number of sources. The system effectively detects source activities and separates the corresponding utterances into correct outputs in both known and unknown source count scenarios.
E. Moliner, V. Välimäki, K. Drossos, and M. Hämäläinen, “Automatic Audio Equalization with Semantic Embeddings,” in AES International Conference on Artificial Intelligence and Machine Learning in Audio (AES AI-MLA), London, U.K., 2025
This paper presents a data-driven approach to automatic blind equalization of audio by predicting log-mel spectral features and deriving an inverse filter. The method uses a deep neural network, where a pre-trained model provides semantic embeddings as a backbone, and only a lightweight head is trained. This design is intended to enhance training efficiency and generalization. Trained on both music and speech, the model is robust to noise and reverberation. Objective evaluations confirm its effectiveness, and subjective tests show performance comparable to that of an oracle that uses true log-mel spectral features, indicating that the model accurately estimates the desired characteristics, with remaining limitations attributed to the filtering stage. Overall, the results highlight the potential of the method for real-world audio enhancement applications.
M. Heikkinen, A. Politis, K. Drossos and T. Virtanen, "Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025
Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The method takes as inputs the MA geometry and MA signals and uses a multi-level encoder consisting of separate paths for geometry and signal data, where geometry features inform the signal encoder at each level. The method is validated in simulated anechoic and reverberant conditions with one and two sources. The results indicate improvement over conventional encoding across the whole frequency range for dry scenes, while for reverberant scenes the improvement is frequency-dependent.
Diep Luong, Mikko Heikkinen, Konstantinos Drossos, and Tuomas Virtanen, “Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance,” 158th Audio Engineering Society Convention, May 22–24, Warsaw, Poland, 2025
Speech denoising is a prominent and widely utilized task, appearing in many common use-cases. Although there are very powerful published machine learning methods, most of those are too complex for deployment in everyday and/or low resources computational environments, like hand-held devices, smart glasses, hearing aids, automotive platforms, etc. Knowledge distillation (KD) is a prominent way for alleviating this complexity mismatch, by transferring the learned knowledge from a pre-trained complex model, the teacher, to another less complex one, the student. KD is implemented by using minimization criteria (e.g. loss functions) between learned information of the teacher and the corresponding one from the student. Existing KD methods for speech denoising hamper the KD by bounding the learning of the student to the distribution learned by the teacher. Our work focuses on a method that tries to alleviate this issue, by exploiting properties of the cosine similarity used as the KD loss function. We use a publicly available dataset, a typical architecture for speech denoising (e.g. UNet) that is tuned for low resources environments and conduct repeated experiments with different architectural variations between the teacher and the student, reporting mean and standard deviation of metrics of our method and another, state-of-the-art method that is used as a baseline. Our results show that with our method we can make smaller speech denoising models, capable to be deployed into small devices/embedded systems, to perform better compared to when typically trained and when using other KD methods.