Publications

Moving Speaker Separation Via Parallel Spectral-Spatial Processing [Journal]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, "Moving Speaker Separation Via Parallel Spectral-Spatial Processing," IEEE Transactions on Audio, Speech and Language Processing, 2026

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

https://ieeexplore.ieee.org/document/11425833

Paper (.pdf)
Updated: 11-03-2026 08:20 - Size: 3.8 MB

BibTex record (.bib)
Updated: 11-03-2026 08:20 - Size: 315 B

Automatic Contextual Audio Denoising [Conference]

D. Luong, K. Drossos, M. Heikkinen, and T. Virtanen, "Automatic Contextual Audio Denoising," in Proceedings of 34th European Signal Conference (EUSIPCO), Bruges, Belgium, 2026

Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising systems apply fixed target-noise definitions, often removing useful components in one context while failing to suppress irrelevant components. To address this, we introduce the concept automatic contextual audio denoising (ACAD) which defines target and noise based on the inferred context. In this work, we restrict context to be associated with an acoustic scene class. We label sound events outside the event distribution of a scene class (noise) as out-of-context (OC) and events typical for that scene as in-context (IC). We implement a deep learning method that automatically infers the context of the audio signal and removes OC components, and benchmark it against variants: without context inference, with oracle context, and with separately provided uninformative context. On paired clean/noisy data across diverse contexts, where OC components in one context may be IC in another, our proposed method outperforms other approaches across standard objective metrics, indicating that the model can infer context and context-dependent processing can enhance denoising.

https://arxiv.org/abs/2605.22262

Paper (.pdf)
Updated: 25-05-2026 19:03 - Size: 527.18 KB

BibTex record (.bib)
Updated: 25-05-2026 19:03 - Size: 251 B

Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention [Conference]

M. Heikkinen, A. Politis, K. Drossos, and T. Virtanen, "Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.

https://arxiv.org/abs/2601.23196

Paper (.pdf)
Updated: 11-03-2026 08:25 - Size: 1.79 MB

BibTex record (.bib)
Updated: 11-03-2026 08:25 - Size: 401 B

Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers [Conference]

M. Silaev, K. Drossos, and T. Virtanen, "Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band (4→16~kHz) and full-band (16→48~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.

https://arxiv.org/abs/2601.03443

Paper (.pdf)
Updated: 08-01-2026 09:46 - Size: 404.91 KB

BibTex record (.bib)
Updated: 08-01-2026 09:46 - Size: 395 B

Method and apparatus for training and using a microphone geometry assisted encoder model to generate spatial audio signals technological field [Patents]

M. O. Heikkinen, K. Drosos, A. Politis, and T. Virtanen, “Method and apparatus for training and using a microphone geometry assisted encoder model to generate spatial audio signals,” U.S. Patent US20260065918A1, filed Aug. 28, 2025; published Mar. 05, 2026

A system for training a microphone geometry assisted encoder model and then utilizing the trained model to generate spatial audio signals that have been captured by a plurality of microphones. In a method for generating spatial audio signals, the method includes receiving geometry data related to a plurality of microphones of an audio capturing device and audio signal data captured by the plurality of microphones. The method also includes generating a spatial audio signal based on an output of a trained microphone geometry assisted encoder model. The trained microphone geometry assisted encoder model includes a geometry encoder configured to encode the geometry data and a signal encoder configured to encode the audio signal data. The trained microphone geometry assisted encoder model further includes a signal decoder having a plurality of layers and configured to generate the output upon which the spatial audio signal is based.

https://patents.google.com/patent/US20260065918A1/en

Model for speech enhancement [Patents]

K. Drosos, M. O. Heikkinen, J. T. Vilkamo, P. Tsiaflakis, “Model for speech enhancement,” U.S. Patent US20260065922A1, filed Aug. 15, 2025; published Mar. 05, 2026

Examples of the disclosure relate to a model that can be used for speech enhancement. The model comprises an encoder part comprising a sequence of encoding layers and caused to receive input data. The input data is based on a current frame of a noisy speech signal and one or more past frames of the noisy speech signal. The sequence of encoding layers is caused to process the input data so that output data of the encoder part comprises a reduced number of the multiple frequency positions and a single temporal position. The model also comprises a decoder part comprising a sequence of decoding layers caused to receive data from a prior decoding layer. The output data of the decoder part comprises multiple frequency positions and a single temporal position. The output data of the decoder part is for post processing to provide an output signal for speech enhancement.

https://patents.google.com/patent/US20260065922A1/en

Speech and noise disentanglement for acoustic echo cancellation [Patents]

K. Drosos, M. O. Heikkinen, S. Vesa, and M. T. Vilermo, “Speech and noise disentanglement for acoustic echo cancellation,” U.S. Patent US20260080885A1, filed Aug. 27 , 2025; published Mar. 19, 2026

The present disclosure relates to an apparatus, that obtains a far-end signal and a near-end microphone signal, determines, based on at least the far-end signal, a far-end speech signal estimate and a far-end noise signal estimate, determines, based on at least the near-end microphone signal, a near-end microphone speech signal estimate and a near-end microphone noise signal estimate, determines, based on at least the far-end speech signal estimate and the near-end microphone speech signal estimate, a predicted near-end speech signal, determines, based on at least the far-end noise signal estimate and the near-end microphone noise signal estimate, a predicted near-end noise signal and outputs at least the predicted near-end speech signal and predicted near-end noise signal.

https://patents.google.com/patent/US20260080885A1/en

The impact of mother's mental health, infant characteristics and war trauma on the acoustic features of infant-directed singing [Journal]

Raija-Leena Punamäki, Safwat Y. Diab, Konstantinos Drosos, and Samir R. Qouta, “The impact of mother’s mental health, infant characteristics and war trauma on the acoustic features of infant‐directed singing,” Infant Mental Health Journal, 2025

Infant-directed singing (IDSi) is a natural means of dyadic communication that contributes to children's mental health by enhancing emotion expression, close relationships, exploration and learning. Therefore, it is important to learn about factors that impact the IDSi. This study modeled the mother- (mental health), infant- (emotional responses and health status) and environment (war trauma)-related factors influencing acoustic IDSi features, such as pitch (F0) variability, amplitude and vibration and the F0 contour of shapes and movements. The participants were 236 mothers and infants from Gaza, the Occupied Palestinian Territories. The mothers reported their mental health problems, infants’ emotionality and regulation skills, and, along with pediatric checkups, illnesses and disorders, as well as traumatic war events that were also photo documented. The results showed that the mothers’ mental health problems and infants’ poor health status were associated with IDSi, characterized by narrow and lifeless amplitude and vibration, and poor health was also associated with the limited and rigid shapes and movements of F0 contours. Traumatic war events were associated with flat and narrow F0 variability and the monotonous and invariable resonance and rhythm of IDSi formants. The infants’ emotional responses did not impact IDSi. The potential of protomusical singing to help war-affected dyads is discussed.

https://onlinelibrary.wiley.com/doi/full/10.1002/imhj.70036

Paper (.pdf)
Updated: 21-09-2025 17:17 - Size: 517.32 KB

BibTex record (.bib)
Updated: 21-09-2025 17:17 - Size: 363 B

Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers [Conference]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, “Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers," in INTERSPEECH 2025, Rotterdam, Netherlands, 2025

This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existing methods by introducing an attractor-based architecture that effectively combines local and global temporal modeling for multi-utterance scenarios. To evaluate the method in reverberant and noisy conditions, a multi-speaker multi-utterance dataset was synthesized by combining Librispeech speech signals with WHAM! noise signals. The results demonstrate that the proposed system accurately estimates the number of sources. The system effectively detects source activities and separates the corresponding utterances into correct outputs in both known and unknown source count scenarios.

https://www.isca-archive.org/interspeech_2025/wang25t_interspeech.html

Paper (.pdf)
Updated: 23-09-2025 10:45 - Size: 1.5 MB

BibTex record (.bib)
Updated: 23-09-2025 10:45 - Size: 276 B

Automatic Audio Equalization with Semantic Embeddings [Conference]

E. Moliner, V. Välimäki, K. Drossos, and M. Hämäläinen, “Automatic Audio Equalization with Semantic Embeddings,” in AES International Conference on Artificial Intelligence and Machine Learning in Audio (AES AI-MLA), London, U.K., 2025

This paper presents a data-driven approach to automatic blind equalization of audio by predicting log-mel spectral features and deriving an inverse filter. The method uses a deep neural network, where a pre-trained model provides semantic embeddings as a backbone, and only a lightweight head is trained. This design is intended to enhance training efficiency and generalization. Trained on both music and speech, the model is robust to noise and reverberation. Objective evaluations confirm its effectiveness, and subjective tests show performance comparable to that of an oracle that uses true log-mel spectral features, indicating that the model accurately estimates the desired characteristics, with remaining limitations attributed to the filtering stage. Overall, the results highlight the potential of the method for real-world audio enhancement applications.

Paper (.pdf)
Updated: 23-09-2025 10:52 - Size: 703.04 KB

BibTex record (.bib)
Updated: 23-09-2025 10:52 - Size: 344 B

Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays [Conference]

M. Heikkinen, A. Politis, K. Drossos and T. Virtanen, "Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025

Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The method takes as inputs the MA geometry and MA signals and uses a multi-level encoder consisting of separate paths for geometry and signal data, where geometry features inform the signal encoder at each level. The method is validated in simulated anechoic and reverberant conditions with one and two sources. The results indicate improvement over conventional encoding across the whole frequency range for dry scenes, while for reverberant scenes the improvement is frequency-dependent.

https://ieeexplore.ieee.org/document/10887869

Paper (.pdf)
Updated: 21-09-2025 17:04 - Size: 572.69 KB

BibTex record (.bib)
Updated: 21-09-2025 17:04 - Size: 398 B

Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance [Conference]

Diep Luong, Mikko Heikkinen, Konstantinos Drossos, and Tuomas Virtanen, “Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance,” 158th Audio Engineering Society Convention, May 22–24, Warsaw, Poland, 2025

Speech denoising is a prominent and widely utilized task, appearing in many common use-cases. Although there are very powerful published machine learning methods, most of those are too complex for deployment in everyday and/or low resources computational environments, like hand-held devices, smart glasses, hearing aids, automotive platforms, etc. Knowledge distillation (KD) is a prominent way for alleviating this complexity mismatch, by transferring the learned knowledge from a pre-trained complex model, the teacher, to another less complex one, the student. KD is implemented by using minimization criteria (e.g. loss functions) between learned information of the teacher and the corresponding one from the student. Existing KD methods for speech denoising hamper the KD by bounding the learning of the student to the distribution learned by the teacher. Our work focuses on a method that tries to alleviate this issue, by exploiting properties of the cosine similarity used as the KD loss function. We use a publicly available dataset, a typical architecture for speech denoising (e.g. UNet) that is tuned for low resources environments and conduct repeated experiments with different architectural variations between the teacher and the student, reporting mean and standard deviation of metrics of our method and another, state-of-the-art method that is used as a baseline. Our results show that with our method we can make smaller speech denoising models, capable to be deployed into small devices/embedded systems, to perform better compared to when typically trained and when using other KD methods.

Paper (.pdf)
Updated: 21-09-2025 17:10 - Size: 615.44 KB

BibTex record (.bib)
Updated: 21-09-2025 17:10 - Size: 310 B

Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns [Conference]

K. Drossos, M. Heikkinen, P. Tsiaflakis, "Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns," in proceedings of the 27th IEEE International Workshop on Multimedia Signal Processing (MMSP 2025), Tsinghua, China, 2025

Speech denoising (SD) is an important task of many, if not all, modern signal processing chains used in devices and for everyday-life applications. While there are many published and powerful deep neural network (DNN)-based methods for SD, few are optimized for resource-constrained platforms such as mobile devices. Additionally, most DNN-based methods for SD are not focusing on full-band (FB) signals, i.e. having 48 kHz sampling rate, and/or low latency cases. In this paper we present a causal, low latency, and lightweight DNN-based method for full-band SD, leveraging both short and long temporal patterns. The method is based on a modified UNet architecture employing look-back frames, temporal spanning of convolutional kernels, and recurrent neural networks for exploiting short and long temporal patterns in the signal and estimated denoising mask. The DNN operates on a causal frame-by-frame basis taking as an input the STFT magnitude, utilizes inverted bottlenecks inspired by MobileNet, employs causal instance normalization for channel-wise normalization, and achieves a real-time factor below 0.02 when deployed on a modern mobile phone. The proposed method is evaluated using established speech denoising metrics and publicly available datasets, demonstrating its effectiveness in achieving an (SI-)SDR value that outperforms existing FB and low latency SD methods.

https://ieeexplore.ieee.org/abstract/document/11324336

Paper (pdf)
Updated: 06-01-2026 09:57 - Size: 370.95 KB

BibTex record (.bib)
Updated: 06-01-2026 09:57 - Size: 349 B

Multi-Utterance Speech Separation and Association Trained on Short Segments [Conference]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, "Multi-Utterance Speech Separation and Association Trained on Short Segments," in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Tahoe City, CA, USA, 2025

Current deep neural network (DNN) based speech separation faces a fundamental challenge — while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker association across utterance gaps exceeding those seen during training. Unlike the conventional segment-separation-stitch paradigm, our lightweight approach (0.9 M parameters) performs inference on long audio without segmentation, eliminating segment boundary distortions while simplifying deployment. Experimental results demonstrate the generalization ability of FTRNN for multi-utterance speech separation and speaker association.

https://ieeexplore.ieee.org/abstract/document/11230969

Paper (.pdf)
Updated: 26-05-2026 13:13 - Size: 2.53 MB

BibTex record (.bib)
Updated: 26-05-2026 13:13 - Size: 384 B

Apparatus, methods and computer programs for noise suppression [Patents]

P. Tsiaflakis, M. T. Tammi, and K. Drosos, “Apparatus, methods and computer programs for noise suppression,” U.S. Patent US20250210055A1, filed Dec. 20, 2024; published Jun 26, 2025

Examples of the disclosure relate noise suppression for audio signals in a communication setting. An apparatus obtains at least one audio signal for a current frame or one or more previous frames, based on at least two microphone signals for the current frame or one or more previous frames. The apparatus uses a program code to predict an output signal for a future frame based, at least in part, on the at least one audio signal for the current frame or one or more previous frames and uses the output signal for processing the future frame of the at least two microphone signals in a first audio signal process and uses the output signal for processing the future frame of an output of the first audio signal process in a second audio signal process to enable noise suppression.

https://patents.google.com/patent/US20250210055A1/en

Adversarial Representation Learning for Robust Privacy Preservation in Audio [Journal]

S. Gharib, M. Tran, D. Luong, K. Drossos and T. Virtanen, "Adversarial Representation Learning for Robust Privacy Preservation in Audio," in IEEE Open Journal of Signal Processing, vol. 5, pp. 294-302, 2024

Sound event detection systems are widely used in various applications such as surveillance and environmental monitoring where data is automatically collected, processed, and sent to a cloud for sound recognition. However, this process may inadvertently reveal sensitive information about users or their surroundings, hence raising privacy concerns. In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings. The proposed method trains a model to generate invariant latent representations of speech-containing audio recordings that cannot be distinguished from non-speech recordings by a speech classifier. The novelty of our work is in the optimization algorithm, where the speech classifier's weights are regularly replaced with the weights of classifiers trained in a supervised manner. This increases the discrimination power of the speech classifier constantly during the adversarial training, motivating the model to generate latent representations in which speech is not distinguishable, even using new speech classifiers trained outside the adversarial training loop. The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method, demonstrating a significant reduction in privacy violations compared to the baseline approach. Additionally, we show that the prior adversarial method is practically ineffective for this purpose.

https://ieeexplore.ieee.org/document/10379095

Paper (.pdf)
Updated: 21-09-2025 16:56 - Size: 4.97 MB

BibTex record (.bib)
Updated: 21-09-2025 16:56 - Size: 326 B

The role of acoustic features of maternal infant-directed singing in enhancing infant sensorimotor, language and socioemotional development [Journal]

R.-L. Punamäki, S. Y. Diab, K. Drosos, S. R. Qouta, and M. Vänskä, “The role of acoustic features of maternal infant-directed singing in enhancing infant sensorimotor, language and socioemotional development,” Infant Behavior and Development, vol. 74, p. 101908, 2024

The quality of infant-directed speech (IDS) and infant-directed singing (IDSi) are considered vital to children, but empirical studies on protomusical qualities of the IDSi influencing infant development are rare. The current prospective study examines the role of IDSi acoustic features, such as pitch variability, shape and movement, and vocal amplitude vibration, timbre, and resonance, in associating with infant sensorimotor, language, and socioemotional development at six and 18 months. The sample consists of 236 Palestinian mothers from Gaza Strip singing to their six-month-olds a song by their own choice. Maternal IDSi was recorded and analyzed by the OpenSMILE- tool to depict main acoustic features of pitch frequencies, variations, and contours, vocal intensity, resonance formants, and power. The results are based on completed 219 maternal IDSi. Mothers reported about their infants’ sensorimotor, language-vocalization, and socioemotional skills at six months, and psychologists tested these skills by Bayley Scales for Infant Development at 18 months. Results show that maternal IDSi characterized by wide pitch variability and rich and high vocal amplitude and vibration were associated with infants’ optimal sensorimotor, language vocalization, and socioemotional skills at six months, and rich and high vocal amplitude and vibration predicted these optimal developmental skills also at 18 months. High resonance and rhythmicity formants were associated with optimal language and vocalization skills at six months. To conclude, the IDSi is considered important in enhancing newborn and risk infants’ wellbeing, and the current findings argue that favorable acoustic singing qualities are crucial for optimal multidomain development across infancy.

https://www.sciencedirect.com/science/article/pii/S0163638323001005

Paper (.pdf)
Updated: 21-09-2025 17:01 - Size: 699.5 KB

BibTex record (.bib)
Updated: 21-09-2025 17:01 - Size: 401 B

Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment [Journal]

Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Liisa Lehtonen, and Okko Räsänen, “Development of a speech emotion recognizer for large-scale child-centered audio recordings from a hospital environment,” Speech Communication, vol. 148, pp. 9-22, 2023

In order to study how early emotional experiences shape infant development, one approach is to analyze the emotional content of speech heard by infants, as captured by child-centered daylong recordings, and as analyzed by automatic speech emotion recognition (SER) systems. However, since large-scale daylong audio is initially unannotated and differs from typical speech corpora from controlled environments, there are no existing in-domain SER systems for the task. Based on existing literature, it is also unclear what is the best approach to deploy a SER system for a new domain. Consequently, in this study, we investigated alternative strategies for deploying a SER system for large-scale child-centered audio recordings from a neonatal hospital environment, comparing cross corpus generalization, active learning (AL), and domain adaptation (DA) methods in the process. We first conducted simulations with existing emotion-labeled speech corpora to find the best strategy for SER system deployment. We then tested how the findings generalize to our new initially unannotated dataset. As a result, we found that the studied AL method provided overall the most consistent results, being less dependent on the specifics of the training corpora or speech features compared to the alternative methods. However, in situations without the possibility to annotate data, unsupervised DA proved to be the best approach. We also observed that deployment of a SER system for real-world daylong child-centered audio recordings achieved a SER performance level comparable to those reported in literature, and that the amount of human effort required for the system deployment was overall relatively modest.

https://www.sciencedirect.com/science/article/pii/S0167639323000262

Paper (.pdf)
Updated: 21-09-2025 16:40 - Size: 1.25 MB

BibTex record (.bib)
Updated: 21-09-2025 16:45 - Size: 377 B

Representation Learning for Audio Privacy Preservation Using Source Separation and Robust Adversarial Learning [Conference]

D. Luong, M. Tran, S. Gharib, K. Drossos and T. Virtanen, "Representation Learning for Audio Privacy Preservation Using Source Separation and Robust Adversarial Learning," IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), NY, USA, 2023

Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system’s operating environment. In this study, we propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning. The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non speech recordings. Initially, the source separation network filters out some of the privacy-sensitive data, and during the adversarial learning process, the system will learn privacy-preserving representation on the filtered signal. We demonstrate the effectiveness of our proposed method by comparing our method against systems without source separation, without adversarial learning, and without both. Overall, our results suggest that the proposed system can significantly improve speech privacy preservation compared to that of using source separation or adversarial learning solely while maintaining good performance in the acoustic monitoring task.

https://ieeexplore.ieee.org/document/10248153

Paper (.pdf)
Updated: 21-09-2025 16:50 - Size: 1.6 MB

BibTex record (.bib)
Updated: 21-09-2025 16:50 - Size: 382 B

Privacy-preserving sound representation [Patents]

T. Virtanen, T. Heittola, S. Zhao, S. Gharib, and K. Drosos, “Privacy-preserving sound representation,” U.S. Patent US20230317086A1, filed Oct. 5, 2022; published Oct. 12, 2023

According to an example embodiment, a method (200) for audio-based monitoring is provided, the method (200) comprising: deriving (202), via usage of a predefined conversion model (M), based on audio data that represents sounds captured in a monitored space, one or more audio features that are descriptive of at least one characteristic of said sounds; identifying (204) respective occurrences of one or more predefined acoustic events in said space based on the one or more audio features; and carrying out (206), in response to identifying an occurrence of at least one of said one or more predefined acoustic events, one or more predefined actions associated with said at least one of said one or more predefined acoustic events, wherein said conversion model (M) is trained to provide said one or more audio features such that they include information that facilitates identification of respective occurrences of said one or more predefined acoustic events while preventing identification of speech characteristics.

https://patents.google.com/patent/US20230317086A1/en

Patent (pdf)
Updated: 06-01-2026 10:00 - Size: 1.56 MB

BibTex record (.bib)
Updated: 06-01-2026 10:03 - Size: 497 B

Clotho-AQA: A crowdsourced dataset for audio question answering [Conference]

S. Lipping, P. Sudarsanam, K. Drossos, and T. Virtanen, "Clotho-AQA: A crowdsourced dataset for audio question answering," in Proceedings of the 30th European Signal Processing Conference (EUSIPCO), pp. 1140-1144, Belgrade, Serbia, 2022

Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have ‘yes’ and ‘no’ as answers, while the remaining two questions have other singleword answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task — a Long short-term memory (LSTM) based multimodal binary classifier for ‘yes’ or ‘no’ type answers and an LSTM based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at https://zenodo.org/record/6473207.

https://ieeexplore.ieee.org/document/9909680

Paper (.pdf)
Updated: 10-09-2025 12:46 - Size: 313.61 KB

BibTex record (.bib)
Updated: 10-09-2025 12:46 - Size: 309 B

Domestic activity clustering from audio via depthwise separable convolutional autoencoder network [Conference]

Y. Li, W. Cao, K. Drossos and T. Virtanen, "Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network," in Proceedings of the IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, pp.1-6, 2022

Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people. This study focuses on solving the problem of domestic activity clustering from audio. The target of domestic activity clustering is to cluster audio clips which belong to the same category of domestic activity into one cluster in an unsupervised way. In this paper, we propose a method of domestic activity clustering using a depthwise separable convolutional autoencoder network. In the proposed method, initial embeddings are learned by the depthwise separable convolutional autoencoder, and a clustering-oriented loss is designed to jointly optimize embedding refinement and cluster assignment. Different methods are evaluated on a public dataset (a derivative of the SINS dataset) used in the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) in 2018. Our method obtains the normalized mutual information (NMI) score of 54.46%, and the clustering accuracy (CA) score of 63.64%, and outperforms state-of-the-art methods in terms of NMI and CA. In addition, both computational complexity and memory requirement of our method is lower than that of previous deep-model-based methods.

https://ieeexplore.ieee.org/document/9949512

Paper (.pdf)
Updated: 10-09-2025 12:51 - Size: 1.42 MB

BibTex record (.bib)
Updated: 10-09-2025 12:51 - Size: 341 B

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases [Conference]

H. Xie, O. Räsänen, K. Drossos, and T. Virtanen, "Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 22-27, Singapore, Singapore, 2022

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.

https://arxiv.org/abs/2110.02939

Paper (.pdf)
Updated: 15-03-2022 09:14 - Size: 735.95 KB

Design Recommendations for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices [Journal]

E. Rovithis, N. Moustakas, K. Vogklis, K. Drossos, and A. Floros, "Design Recommendations for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices," in Journal of the Audio Engineering Society, vo. 69 (12), pp. 956-966, 2021, doi: 10.17743/jaes.2021.0043

Citizen Science aims to engage people in research activities on important issues related to their well-being. Smart Cities aim to provide them with services that improve the quality of their life. Both concepts have seen significant growth in the last years and can be further enhanced by combining their purposes with Internet of Things technologies that allow for dynamic and large-scale communication and interaction. However, exciting and retaining the interest of participants is a key factor for such initiatives. In this paper we suggest that engagement in Citizen Science projects applied on Smart Cities infrastructure can be enhanced through contextual and structural game elements realized through augmented audio interactive mechanisms. Our interdisciplinary framework is described through the paradigm of a collaborative bird call recognition game, in which users collect and submit audio data that are then classified and used for augmenting physical space. We discuss the Playful Learning, Internet of Audio Things, and Bird Monitoring principles that shaped the design of our paradigm and analyze the design issues of its potential technical implementation.

https://www.aes.org/e-lib/browse.cfm?elib=21544

Paper (.pdf)
Updated: 15-03-2022 08:53 - Size: 3.53 MB

BibTeX record (.bib)
Updated: 15-03-2022 08:55 - Size: 448 B

Enriched Music Representations with Multiple Cross-modal Contrastive Learning [Journal]

A. Ferraro, X. Favory, K. Drossos, Y. Kim and D. Bogdanov, "Enriched Music Representations with Multiple Cross-modal Contrastive Learning," in IEEE Signal Processing Letters, doi: 10.1109/LSP.2021.3071082.

Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks' audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.

https://doi.org/10.1109/LSP.2021.3071082

Paper (.pdf)
Updated: 08-04-2021 11:35 - Size: 548.7 KB

BibTeX record (.bib)
Updated: 08-04-2021 11:36 - Size: 385 B

Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection [Conference]

P. Sudarsanam, A. Politis, and K. Drossos, "Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 100-104, Barcelona, Spain, 2021

Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from these high-level features. However, RNNs have some drawbacks, such as a limited capability to model long temporal dependencies and slow training and inference times due to their sequential processing nature. Recently, a few SELD studies used multi-head self-attention (MHSA), among other innovations in their models. MHSA and the related transformer networks have shown state-of-the-art performance in various domains. While they can model long temporal dependencies, they can also be parallelized efficiently. In this paper, we study in detail the effect of MHSA on the SELD task. Specifically, we examined the effects of replacing the RNN blocks with self-attention layers. We studied the influence of stacking multiple self-attention blocks, using multiple attention heads in each self-attention block, and the effect of position embeddings and layer normalization. Evaluation on the DCASE 2021 SELD (task 3) development data set shows a significant improvement in all employed metrics compared to the baseline CRNN accompanying the task.

https://arxiv.org/abs/2107.09388

Paper (.pdf)
Updated: 15-03-2022 09:26 - Size: 249.15 KB

Automatic analysis of the emotional content of speech in daylong child-centered recordings from a neonatal intensive care unit [Conference]

E. Vaaras, S. Ahlqvist-Björkroth, K. Drossos, O. Räsänen, "Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit", in Proceedings of Interspeech 2021, pp. 3380-3384, Brno, Czech Republic, 2021

Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.

https://arxiv.org/abs/2106.09539

BibTex record (.bib)
Updated: 08-09-2025 13:00 - Size: 383 B

Paper (.pdf)
Updated: 08-09-2025 13:00 - Size: 498.29 KB

Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach [Conference]

J. Berg and K. Drossos, "Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 140-144, Barcelona, Spain, 2021

Automated audio captioning (AAC) is the task of automatically creating textual descriptions (i.e. captions) for the contents of a general audio signal. Most AAC methods are using existing datasets to optimize and/or evaluate upon. Given the limited information held by the AAC datasets, it is very likely that AAC methods learn only the information contained in the utilized datasets. In this paper we present a first approach for continuously adapting an AAC method to new information, using a continual learning method. In our scenario, a pre-optimized AAC method is used for some unseen general audio signals and can update its parameters in order to adapt to the new information, given a new reference caption. We evaluate our method using a freely available, pre-optimized AAC method and two freely available AAC datasets. We compare our proposed method with three scenarios, two of training on one of the datasets and evaluating on the other and a third of training on one dataset and fine-tuning on the other. Obtained results show that our method achieves a good balance between distilling new knowledge and not forgetting the previous one.

https://arxiv.org/abs/2107.08028

Paper (.pdf)
Updated: 15-03-2022 09:25 - Size: 337.28 KB

BibTex record (.bib)
Updated: 15-03-2022 11:28 - Size: 609 B

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning [Conference]

B. Weck, X. Favory, K. Drossos, and X. Serra, "Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 60-64, Barcelona, Spain, 2021

Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions.

https://arxiv.org/abs/2110.07410

Paper (.pdf)
Updated: 15-03-2022 09:27 - Size: 426.4 KB

Fairness and underspecification in acoustic scene classification: The case for disaggregated evaluations [Conference]

A. Triantafyllopoulos, M. Milling, K. Drossos, and B. - W. Schuller, "Fairness and underspecification in acoustic scene classification: The case for disaggregated evaluations," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 70-74, Barcelona, Spain, 2021

Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community. Acoustic scene classification (ASC) applications have so far remained unaffected by this discussion, but are now becoming increasingly used in real-world systems where fairness and reliability are critical aspects. In this work, we argue for the need of a more holistic evaluation process for ASC models through disaggregated evaluations. This entails taking into account performance differences across several factors, such as city, location, and recording device. Although these factors play a well-understood role in the performance of ASC models, most works report single evaluation metrics taking into account all different strata of a particular dataset. We argue that metrics computed on specific sub-populations of the underlying data contain valuable information about the expected real-world behaviour of proposed systems, and their reporting could improve the transparency and trustability of such systems. We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems exhibited by several standard ML architectures when trained on two widely-used ASC datasets. Our evaluation shows that all examined architectures exhibit large biases across all factors taken into consideration, and in particular with respect to the recording location. Additionally, different architectures exhibit different biases even though they are trained with the same experimental configurations.

https://arxiv.org/abs/2110.01506

Paper (.pdf)
Updated: 15-03-2022 09:19 - Size: 254.81 KB

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags [Conference]

X. Favory, K. Drossos, T. Virtanen, and X. Serra, "Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags," in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 6-11, Torono, Canada, 2021

Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.

https://arxiv.org/abs/2010.14171

Paper (.pdf)
Updated: 08-04-2021 12:13 - Size: 415.41 KB

BibTeX record (.bib)
Updated: 08-04-2021 12:13 - Size: 351 B

Towards Sonification in Multimodal and User-friendly Explainable Artificial Intelligence [Conference]

B. - W. Schuller, T. Virtanen, M. Riveiro, G. Rizos, J. Han, A. Mesaros, and K. Drossos, "Towards Sonification in Multimodal and User-friendly Explainable Artificial Intelligence," in Proceedings of the 23rd ACM International Conference on Multimodal Interaction, Oct 18-22, Montreal, Canada, 2021

We are largely used to hearing explanations. For example, if someone thinks you are sad today, they might reply to your “why?” with “because you were so Hmmmmm-mmm-mmm”. Today’s Artificial Intelligence (AI), however, is – if at all – largely providing explanations of decisions in a visual or textual manner. While such approaches are good for communication via visual media such as in research papers or screens of intelligent devices, they may not always be the best way to explain; especially when the end user is not an expert. In particular, when the AI’s task is about Audio Intelligence, visual explanations appear less intuitive than audible, sonified ones. Sonification has also great potential for explainable AI (XAI) in systems that deal with non-audio data – for example, because it does not require visual contact or active attention of a user. Hence, sonified explanations of AI decisions face a challenging, yet highly promising and pioneering task. That involves incorporating innovative XAI algorithms to allow pointing back at the learning data responsible for decisions made by an AI, and to include decomposition of the data to identify salient aspects. It further aims to identify the components of the preprocessing, feature representation, and learnt attention patterns that are responsible for the decisions. Finally, it targets decision-making at the model-level, to provide a holistic explanation of the chain of

processing in typical pattern recognition problems from end-to-end. Sonified AI explanations will need to unite methods for sonification of the identified aspects that benefit decisions, decomposition and recomposition of audio to sonify which parts in the audio were responsible for the decision, and rendering attention patterns and salient feature representations audible. Benchmarking sonified XAI is challenging, as it will require a comparison against a backdrop of existing, state-of-the-art visual and textual alternatives, as well as synergistic complementation of all modalities in user evaluations. Sonified AI explanations will need to target different user groups to allow personalisation of the sonification experience for different user needs, to lead to a major breakthrough in comprehensibility of AI via hearing how decisions are made, hence supporting tomorrow’s humane AI’s trustability. Here, we introduce and motivate the general idea, and provide accompanying considerations including milestones of realisation of sonifed XAI and foreseeable risks.

https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/91495/file/91495.pdf

Paper (.pdf)
Updated: 15-03-2022 09:05 - Size: 485.65 KB

WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information [Conference]

A. Tran, K. Drossos, and T. Virtanen, "WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information," in proceedings of 29th European Signal Processing Conference (EUSIPCO), Aug. 23-27, Dublin, Ireland, 2021

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.

Paper (.pdf)
Updated: 06-05-2021 10:41 - Size: 230.2 KB

BibTeX record (.bib)
Updated: 06-05-2021 10:41 - Size: 336 B

Clotho: An Audio Captioning Dataset [Conference]

Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen, “Clotho: An Audio Captioning Dataset,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

https://ieeexplore.ieee.org/document/9052990/

Paper (.pdf)
Updated: 11-04-2020 19:51 - Size: 224.77 KB

BibTeX record (.bib)
Updated: 11-04-2020 19:56 - Size: 297 B

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations [Conference]

Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, and Xavier Serra, "COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations," in International Conference on Machine Learning (ICML), Workshop on Self-supervised learning in Audio and Speech, Jul. 17, virtually held, 2020

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.

https://arxiv.org/abs/2006.08386

Paper (.pdf)
Updated: 28-08-2020 09:41 - Size: 483.83 KB

BibTex record (.bib)
Updated: 28-08-2020 09:49 - Size: 375 B

Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation [Conference]

Pyry Pyykkönen, Styliannos I. Mimilakis, Konstantinos Drossos, and Tuomas Virtanen, "Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation," in proceedings of the 22nd IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 21-24, Tampere, Finland, 2020

Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.

https://arxiv.org/abs/2007.02683

Paper (.pdf)
Updated: 28-08-2020 09:47 - Size: 147.6 KB

BibTex record (.bib)
Updated: 28-08-2020 09:47 - Size: 376 B

Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters [Conference]

Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti, "Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Aug. 24 - 28, Amsterdam, Netherlands, 2020

Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2.7%.

https://arxiv.org/abs/1911.00527

Paper (.pdf)
Updated: 31-05-2020 15:14 - Size: 342.82 KB

BibTeX record (.bib)
Updated: 31-05-2020 15:27 - Size: 361 B

Multi-task Regularization Based on Infrequent Classes for Audio Captioning [Conference]

Emre Çakır, Konstantinos Drossos, Tuomas Virtanen, "Multi-task Regularization Based on Infrequent Classes for Audio Captioning," in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Nov. 2-3, Tokyo, Japan (full virtual), 2020

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, i.e. the function words (e.g. "a", "the"), and other words are infrequent but informative, i.e. the content words (e.g. adjectives, nouns). In this paper we propose two methods to mitigate this class imbalance problem. First, in an autoencoder setting for audio captioning, we weigh each word's contribution to the training loss inversely proportional to its number of occurrences in the whole dataset. Secondly, in addition to multi-class, word-level audio captioning task, we define a multi-label side task based on clip-level content word detection by training a separate decoder. We use the loss from the second task to regularize the jointly trained encoder for the audio captioning task. We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset, and our results show an increase of 37% relative improvement with SPIDEr metric over the baseline method.

https://arxiv.org/abs/2007.04660

Paper (.pdf)
Updated: 11-09-2020 10:47 - Size: 188.35 KB

BibTex record (.bib)
Updated: 11-09-2020 10:47 - Size: 415 B

Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CNMF [Conference]

Antonio J. Muñoz-Montoro, Julio J. Carabias-Orti, Archontis Politis, and Konstantinos Drossos, "Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CNMF," in proceedings of the 22nd IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 21-24, Tampere, Finland, 2020

This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep-learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser Twin Network (MaD TwinNet), able to model long-term temporal patterns of a musical piece. The monophonic source spectrograms are used within a spatial covariance mixing model based on Complex Non-Negative Matrix Factorization (CNMF) that predicts the spatial characteristics of each source. The proposed framework is evaluated on the task of singing voice separation with a large multichannel dataset. Experimental results show that our joint DL+CNMF method outperforms both the individual monophonic DL-based separation and the multichannel CNMF baseline methods.

https://arxiv.org/abs/2003.01162

Paper (.pdf)
Updated: 28-08-2020 09:52 - Size: 173.71 KB

BibTex record (.bib)
Updated: 28-08-2020 09:52 - Size: 378 B

Sound Event Detection via Dilated Convolutional Recurrent Neural Networks [Conference]

Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen, “Sound event detection via dilated convolutional recurrent neural networks,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020

Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9%, 6.3% and 2.5% at F1 score and a maximum decrease of 1.7%, 4.1% and 3.9% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.

https://ieeexplore.ieee.org/document/9054433/

Paper (.pdf)
Updated: 11-04-2020 19:47 - Size: 365.57 KB

BibTeX record (.bib)
Updated: 11-04-2020 19:56 - Size: 345 B

Sound Event Detection with Depthwise Separable and Dilated Convolutions [Conference]

Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen “Sound Event Detection with Depthwise Separable and Dilated Convolutions,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 19–24, Glasgow, Scotland, 2020

State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions. Additionally, the usually long sequences that are used as an input to an SED method along with the employment of an RNN, introduce implications like increased training time, difficulty at gradient flow, and impeding the parallelization of the SED method. To tackle all these problems, we propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions. We compare the proposed method to a baseline convolutional neural network on a SED task, and achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively.

Paper (.pdf)
Updated: 11-04-2020 19:57 - Size: 414.35 KB

BibTeX record (.bib)
Updated: 11-04-2020 19:57 - Size: 337 B

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning [Conference]

Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen, "Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning," in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Nov. 2-3, Tokyo, Japan (full virtual), 2020

Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.

https://arxiv.org/abs/2007.02676

Paper (.pdf)
Updated: 11-09-2020 10:51 - Size: 349.69 KB

BibTeX record (.bib)
Updated: 11-09-2020 10:51 - Size: 414 B

Unsupervised Interpretable Representation Learning for Singing Voice Separation [Conference]

Stylianos I. Mimilakis, Konstantinos Drossos, Gerald Schuller, "Unsupervised Interpretable Representation Learning for Singing Voice Separation," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Jan. 18 - 22 (2021), Amsterdam, Netherlands, 2020

In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of our method, we employ the obtained representations to the task of informed singing voice separation via binary masking, and measure the obtained separation quality by means of scale-invariant signal to distortion ratio. Our findings suggest that our method is capable of learning meaningful representations for singing voice separation, while preserving conveniences of the the short-time Fourier transform like non-negativity, smoothness, and reconstruction subject to time-frequency masking, that are desired in audio and music source separation.

https://arxiv.org/abs/2003.01567

Paper (.pdf)
Updated: 31-05-2020 15:26 - Size: 399.09 KB

BibTeX record (.bib)
Updated: 31-05-2020 15:26 - Size: 315 B

Examining the Mapping Functions of Denoising Autoencoders in Music Source Separation [Journal]

Stylianos Ioannis Mimilakis, Konstantinos Drossos, Estefanía Cano, and Geralrd Schuller, “Examining the Mapping Functions of Denoising Autoencoders in Music Source Separation,” in IEEE/ACM Transaction on Audio, Speech, and Language Processing (TASLP), vol. 28, pp 262-278, 2019.

The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures.

https://ieeexplore.ieee.org/document/8892665

BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 331 B

Crowdsourcing a Dataset of Audio Captions [Conference]

Samuel Lipping, Konstantinos Drossos, and Tuomas Virtanen, “Crowdsourcing a Dataset of Audio Captions,” in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

https://arxiv.org/abs/1907.09238

Paper (.pdf)
Updated: 12-11-2019 23:59 - Size: 274.06 KB

BibTeX record (.bib)
Updated: 19-06-2021 10:21 - Size: 407 B

Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling [Conference]

Konstantinos Drossos, Shayan Gharib, Paul Magron, and Tuomas Virtanen, “Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling,” in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019

A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 6% and 3% at the F1 (higher is better) and a decrease of 3% and 2% at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 10% at F1 score and an increase of 11% at ER for the TUT-SED Synthetic 2016 dataset.

https://arxiv.org/abs/1907.08506

Paper (.pdf)
Updated: 01-11-2019 22:12 - Size: 598.06 KB

BibTeX record (.bib)
Updated: 29-11-2019 11:19 - Size: 467 B

Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification [Conference]

Konstantinos Drossos, Paul Magron, and Tuomas Virtanen, “Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification,” accepted for publication at the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 20–23, N. Paltz, NY, U.S.A., 2019

A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32% to 45%, using the TUT Acoustic Scenes dataset.

https://ieeexplore.ieee.org/abstract/document/8937231

Paper (.pdf)
Updated: 01-11-2019 20:57 - Size: 352.4 KB

BibTeX record (.bib)
Updated: 29-11-2019 11:19 - Size: 355 B

Examining the Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation [Conference]

Stylianos Ioannis Mimilakis, Estafanía Cano, Derry Fitzgerald, Konstantinos Drossos, and Gerald Schuller, “Examining the Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation,” in proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, Oct. 28–31, Pacific Grove, CA, U.S.A., 2018

In this study, we examine the effect of various objective functions used to optimize the recently proposed deep learning architecture for singing voice separation MaD - Masker and Denoiser. The parameters of the MaD architecture are optimized using an objective function that contains a reconstruction criterion between predicted and true magnitude spectra of the singing voice, and a regularization term. We examine various reconstruction criteria such as the generalized Kullback-Leibler, mean squared error, and noise to mask ratio. We also explore recently proposed, for optimizing MaD, regularization terms such as sparsity and TwinNetwork regularization. Results from both objective assessment and listening tests suggest that the TwinNetwork regularization results in improved singing voice separation quality.

https://ieeexplore.ieee.org/document/8645257

Paper (.pdf)
Updated: 13-11-2019 00:14 - Size: 294.89 KB

BibTex record (.bib)
Updated: 01-02-2020 09:20 - Size: 435 B

Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery [Conference]

Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery,” in proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 17–20, Tokyo, Japan, 2018

Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-time Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach.

https://ieeexplore.ieee.org/document/8521371

Paper (.pdf)
Updated: 13-11-2019 00:34 - Size: 1.07 MB

BibTex record (.bib)
Updated: 01-02-2020 09:18 - Size: 369 B

MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation [Conference]

Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitry Serdyuk, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 8–13, Rio de Janeiro, Brazil, 2018

Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.

https://ieeexplore.ieee.org/document/8489565

Paper (.pdf)
Updated: 13-11-2019 00:43 - Size: 534.27 KB

BibTex record (.bib)
Updated: 01-02-2020 09:18 - Size: 436 B

Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time- Frequency Mask [Conference]

Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time- Frequency Mask,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 15–20, Calgary, Canada, 2018

Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.

https://ieeexplore.ieee.org/document/8461822

Paper (.pdf)
Updated: 13-11-2019 00:46 - Size: 319.89 KB

BibTex record (.bib)
Updated: 01-02-2020 08:54 - Size: 480 B

Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation [Conference]

Paul Magron, Konstantinos Drossos, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation,” in proceedings of the INTERSPEECH 2018, Sep. 2–6, Hyderabad, India, 2018

State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-time Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency, a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.

https://www.isca-speech.org/archive/Interspeech_2018/abstracts/1845.html

Paper (.pdf)
Updated: 13-11-2019 00:41 - Size: 4.7 MB

BibTex record (.bib)
Updated: 01-02-2020 08:48 - Size: 412 B

Unsupervised Adversarial Domain Adaptation for Acoustic Scene Classification [Conference]

Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitry Serdyuk, and Tuomas Virtanen, “Unsupervised adversarial domain adaptation for acoustic scene classification,” in proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Non. 19–20, Surrey, U.K., 2018

A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼10% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.

https://arxiv.org/abs/1808.05777

Paper (.pdf)
Updated: 13-11-2019 00:05 - Size: 1.24 MB

BibTex record (.bib)
Updated: 01-02-2020 08:47 - Size: 511 B

A Recurrent Encoder-Decoder Approach with Skip-Filtering Connections for Monaural Singing Voice Separation [Conference]

Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “A Recurrent Encoder-Decoder Approach with Skip-Filtering Connections for Monaural Singing Voice Separation, ” in proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 25–28, Tokyo, Japan, 2017

The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.

https://ieeexplore.ieee.org/document/8168117

Paper (.pdf)
Updated: 13-11-2019 00:51 - Size: 1.75 MB

BibTex record (.bib)
Updated: 01-02-2020 08:42 - Size: 414 B

Automated Audio Captioning with Recurrent Neural Networks [Conference]

Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen, “Automated Audio Captioning with Recurrent Neural Networks,” in proceedings of the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 15–18, New Paltz, N.Y. U.S.A., 2017.

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

https://ieeexplore.ieee.org/document/8170058

Paper (.pdf)
Updated: 13-11-2019 00:48 - Size: 205.6 KB

BibTex record (.bib)
Updated: 01-02-2020 08:40 - Size: 368 B

Close Miking Empirical Practice Verification: A Source Separation Approach [Conference]

Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, Tuomas Virtanen, and Gerald Schuller, “Close Miking Empirical Practice Verification: A Source Separation Approach”, in proceedings of the 142nd Audio Engineering Society (AES) Convention, May 20–23, Berlin, Germany, 2017

Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we calculate the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording.

http://www.aes.org/e-lib/browse.cfm?elib=18636

Paper (.pdf)
Updated: 02-12-2019 13:04 - Size: 171.9 KB

BibTeX record (.bib)
Updated: 02-12-2019 13:04 - Size: 396 B

Convolutional Recurrent Neural Networks for Bird Audio Detection [Conference]

Emre Çakir, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, and Tuomas Virtanen, “Convolutional Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017

Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge.

https://ieeexplore.ieee.org/document/8081508

Paper (.pdf)
Updated: 13-11-2019 00:56 - Size: 271.39 KB

BibTex record (.bib)
Updated: 01-02-2020 08:40 - Size: 393 B

Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection [Conference]

Sharath Adavanne, Konstantinos Drossos, Emre Çakir, and Tuomas Virtanen, “Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017

This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.

https://ieeexplore.ieee.org/document/8081505

Paper (.pdf)
Updated: 13-11-2019 00:59 - Size: 143.49 KB

BibTex record (.bib)
Updated: 01-02-2020 08:33 - Size: 457 B

Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition [Conference]

Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, and Roman Jarina, “Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition”, in proceedings of the 14th Sound and Music Computing (SMC) conference, Jul. 5–8, Helsinki, Finland, 2017

This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the “MediaEval2015 emotion in music” dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset.

http://smc2017.aalto.fi/proceedings.html

Paper (.pdf)
Updated: 02-12-2019 13:07 - Size: 137.37 KB

BibTeX record (.bib)
Updated: 02-12-2019 13:07 - Size: 406 B

On the Impact of the Semantic Content of Sound Events in Emotion Elicitation [Journal]

Konstantinos Drossos, Maximos Kaliakatsos-Papakostas, Andreas Floros, and Tuomas Virtanen, “On the Impact of the Semantic Content of Sound Events in Emotion Elicitation,” Journal of Audio Engineering Society, Vol. 64, No. 7/8, pp. 525–532, 2016

Sound events are proven to have an impact on the emotions of the listener. Recent works on the field of emotion recognition from sound events show, on one hand, the possibility of automatic emotional information retrieval from sound events and, on the other hand, the need for deeper understanding of the significance of the sound events’ semantic content on listener’s affective state. In this work we present a first, to the best of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal-Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.

http://www.aes.org/e-lib/browse.cfm?elib=18338

Paper (.pdf)
Updated: 02-11-2019 12:29 - Size: 876.85 KB

BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 393 B

Deep Neural Networks for Dynamic Range Compression in Mastering Applications [Conference]

Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “Deep Neural Networks for Dynamic Range Compression in Mastering Applications”, in proceedings of the 140th Audio Engineering Society (AES) Convention, Jul. 4–7, Paris, France, 2016

The process of audio mastering often, if not always, includes various audio signal processing techniques such as frequency equalization and dynamic range compression. With respect to the genre and style of the audio content, the parameters of these techniques are controlled by a mastering engineer, in order to process the original audio material. This operation relies on musical and perceptually pleasing facets of the perceived acoustic characteristics, transmitted from the audio material under the mastering process. Modeling such dynamic operations, which involve adaptation regarding the audio content, becomes vital in automated applications since it significantly affects the overall performance. In this work we present a system capable of modelling such behavior focusing on the automatic dynamic range compression. It predicts frequency coefficients that allow the dynamic range compression, via a trained deep neural network, and applies them to unmastered audio signal served as input. Both dynamic range compression and the prediction of the corresponding frequency coefficients take place inside the time-frequency domain, using magnitude spectra acquired from a critical band filter bank, similar to humans’ peripheral auditory system. Results from conducted listening tests, incorporating professional music producers and audio mastering engineers, demonstrate on average an equivalent performance compared to professionally mastered audio content. Improvements were also observed when compared to relevant and commercial software.

http://www.aes.org/e-lib/browse.cfm?elib=18237

Paper (.pdf)
Updated: 02-12-2019 13:02 - Size: 295.27 KB

BibTeX record (.bib)
Updated: 02-12-2019 13:02 - Size: 379 B

Affective Audio Synthesis for Sound Experience Enhancement [Book Chapter]

Konstantinos Drossos, Maximos Kaliakatsos-Papakostas, and Andreas Floros, “Affective Audio Synthesis for Sound Experience Enhancement”, Experimental Multimedia Systems for Interactivity and Strategic Inovation, I. Deliyannis, P. Kostagiolas (Eds), IGI-Global, 2016

With the advances of technology, multimedia tend to be a recurring and prominent component in almost all forms of communication. Although their content spans in various categories, there are two protuberant channels that are used for information conveyance, i.e. audio and visual. The former can transfer numerous content, ranging from low-level characteristics (e.g. spatial location of source and type of sound producing mechanism) to high and contextual (e.g. emotion). Additionally, recent results of published works depict the possibility for automated synthesis of sounds, e.g. music and sound events. Based on the above, in this chapter the authors propose the integration of emotion recognition from sound with automated synthesis techniques. Such a task will enhance, on one hand, the process of computer driven creation of sound content by adding an anthropocentric factor (i.e. emotion) and, on the other, the experience of the multimedia user by offering an extra constituent that will intensify the immersion and the overall user experience level.

Paper (.pdf)
Updated: 12-11-2019 23:55 - Size: 1.04 MB

BibTeX record (.bib)
Updated: 29-11-2019 11:19 - Size: 562 B

Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence [Journal]

Konstantinos Drossos, Andreas Floros, and Katia Kermanidis, “Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence”, Journal of Audio Engineering Society, Vol. 63, No. 3, pp. 139–153, 2015

While modern sound researchers generally focus on speech and music, mammalian hearing arose from the need to sense those events in the environment that produced sound waves. Such unorganized sound stimuli, referred to as Sound Events (SEs), can also produce an affective and emotional response. In this research, the investigators explore valence recognition of SEs utilizing rhythm-related acoustics cues. A well-known data set with emotionally annotated SEs was employed; various rhythm-related attributes were then extracted and several machine-learning experiments were conducted. The results portray that the rhythm of a SE can affect the listener’s valence up to an extent and, combined with previous works on SEs, could lead to a comprehensive recognition of the rhythm’s effect on the emotional state of the listener.

http://www.aes.org/e-lib/browse.cfm?elib=17573

Paper (.pdf)
Updated: 02-11-2019 12:24 - Size: 685.04 KB

BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 380 B

Investigating the Impact of Sound Angular Position on the Listener Affective State [Journal]

Konstantinos Drossos, Andreas Floros, Andreas Giannakoulopoulos, and Nikolaos Kanellopoulos, “Investigating the Impact of Sound Angular Position on the Listener Affective State”, IEEE Transactions on Affective Computing, Vol. 6, No. 1, pp. 27–42, 2015

Emotion recognition from sound signals represents an emerging field of recent research. Although many existingworks focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition fromgeneral sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the sourcerelatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and theelicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener’s emotional state, modeled in the well–established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.

https://ieeexplore.ieee.org/document/7010884

Paper (.pdf)
Updated: 02-11-2019 12:11 - Size: 919.79 KB

BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 408 B

Accessible Games for Blind Children, Empowered by Binaural Sound [Conference]

Konstantinos Drossos, Nikolaos Zormpas, George Giannakopoulos, and Andreas Floros, “Accessible Games for Blind Children, Empowered by Binaural Sound," in proceedings of the 8th Pervasive Technologies Related to Assistive Environments (PETRA) Conference, Jul. 1–3, Corfu, Greece, 2015

Accessible games have been researched and developed for many years, however, blind people still have very limited access and knowledge of them. This can pose a serious limitation, especially for blind children, since in recent years electronic games have become one of the most common and wide spread means of entertainment and socialization. For our implementation we use binaural technology which allows the player to hear and navigate the game space by adding localization information to the game sounds. With our implementation and user studies we provide insight on what constitutes an accessible game for blind people as well as a functional game engine for such games. The game engine developed allows the quick development of games for the visually impaired. Our work provides a good starting point for future developments on the field and, as the user studies show, was very well perceived by the visually impaired children that tried it.

https://dl.acm.org/citation.cfm?id=2769546

Paper (.pdf)
Updated: 02-12-2019 12:58 - Size: 285.54 KB

BibTeX record (.bib)
Updated: 02-12-2019 12:58 - Size: 663 B

A Loudness-based Adaptive Equalization Technique for Subjectively Improved Sound Reproduction [Conference]

Konstantinos Drossos, Andreas Floros, and Nikolaos Kanellopoulos, “A Loudness-based Adaptive Equalization Technique for Subjectively Improved Sound Reproduction," in proceedings of the Audio Engineering Society (AES) 136th convention, Apr. 26–29, Berlin, Germany, 2014.

Sound equalization is a common approach for objectively or subjectively defining the reproduction level at specific frequency bands. It is also well-known that the human auditory system demonstrates an inner process of sound-weighting. Due to this, the perceived loudness changes with the frequency and the user-defined sound reproduction gain, resulting into a deviation of the intended and the perceived equalization scheme as the sound level changes. In this work we introduce a novel equalization approach that takes into account the above perceptual loudness effect in order to achieve subjectively constant equalization. A series of listening tests shows that the proposed equalization technique is an efficient and listener-preferred alternative for both professional and home audio reproduction applications.

http://www.aes.org/e-lib/online/browse.cfm?elib=17231

Paper (.pdf)
Updated: 02-12-2019 11:59 - Size: 388.1 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:59 - Size: 380 B

A socially-intelligent multirobot service team for in-home monitoring [Conference]

Konstantinos Drossos, Andreas Floros, Stelios Potirakis, Nikolas-Alexander Tatlas, and Gurkan Tuna, “A socially-intelligent multirobot service team for in-home monitoring," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.

The objective of this study is to develop a socially-intelligent service team comprised of multiple robots with sophisticated sonic interaction capabilities that aims to transparently collaborate towards efficient and robust monitoring by close interaction. In the distributed scenario proposed in this study, the robots share any acoustic data extracted from the environment and act in-sync with the events occurring in their living environment in order to provide potential means for efficient monitoring and decision-making within a typical home enclosure. Although each robot acts as an individual recognizer using a novel emotionally-enriched word recognition system, the final decision is social in nature and is followed by all. Moreover, the social decision stage triggers actions that are algorithmically distributed among the robots' population and enhances the overall approach with the potential advantages of the team work within specific communities through collaboration.

https://ieeexplore.ieee.org/document/6878763

Paper (.pdf)
Updated: 02-12-2019 12:46 - Size: 285.12 KB

BibTeX record (.bib)
Updated: 02-12-2019 12:46 - Size: 440 B

BEADS: A Dataset of Binaural Emotionally Annotated Digital Sounds [Conference]

Konstantinos Drossos, Andreas Floros, and Andreas Giannakoulopoulos, “BEADS: A Dataset of Binaural Emotionally Annotated Digital Sounds," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.

Emotion recognition from generalized sounds is an interdisciplinary and emerging field of research. A vital requirement for this kind of investigations is the availability of ground truth datasets. Currently, there are 2 freely available datasets of emotionally annotated sounds, which, however, do not include sound evenets (SEs) with manifestation of the spatial location of the source. The latter is an inherent natural component of SEs, since all sound sources in real-world conditions are physically located and perceived somewhere in the listener's surrounding space. In this work we present a novel emotionally annotated sounds dataset consisting of 32 SEs that are spatially rendered using appropriate binaural processing. All SEs in the dataset are available in 5 spatial positions corresponding to source/receiver angles equal to 0, 45, 90, 135 and 180 degrees. We have used the IADS dataset as the initial collection of SEs prior to binaural processing. The annotation measures obtained for the novel binaural dataset demonstrate a significant accordance with the existing IADS dataset, while small ratings declinations illustrate a perceptual adaptation imposed by the more realistic SEs spatial representation.

https://ieeexplore.ieee.org/document/6878749

Paper (.pdf)
Updated: 02-12-2019 12:43 - Size: 617.35 KB

BibTeX record (.bib)
Updated: 02-12-2019 12:43 - Size: 393 B

BEADS annotations.
Updated: 02-12-2019 12:43 - Size: 21.35 KB

Swarm Lake: A Game of Swarm Intelligence, Human Interaction and Collaborative Music Composition [Conference]

Maximos Kaliakatsos–Papakostas, Andreas Floros, Konstantinos Drossos, Konstantinos Koukoudis, Manolis Kuzalas, and Achileas Kalantzis, “Swarm Lake: A Game of Swarm Intelligence, Human Interaction and Collaborative Music Composition," in proceedings of the Joint Conference ICMC/SMC 2014, Sep. 14–20, Athens, Greece, 2014.

In this work we aim to combine a game platform with the concept of collaborative music synthesis. We use bio-inspired intelligence for developing a world - the Lake - where multiple tribes of artiﬁcial, autonomous agents live within, having survival as their ultimate goal. The tribes exhibit primitive social swarm-based behavior and intelligence, which is used for taking actions that will potentially allow to dominate the game world. Tribes’ populations also demonstrate a number of physical properties that re-strict their ability to act illimitably. Multiuser interventionis employed in parallel, affecting the automated decisions and the physical parameters of the tribes, thus infusing the gaming orientation of the application context. Finally,sound synthesis is achieved through a complex mapping scheme established between the events occurring in the Lake and the rhythmic, harmonic and dynamic-range parameters of an advanced, collaborative sound composition engine. This complex mapping scheme allows the production of interesting and complicated sonic patterns that fol-low the performance evolution in both objective and conceptual levels. The overall synthesis process is controlled by the conductor, a virtual entity that determines the synthesis evolution in a way that is very similar to directing an ensemble performance in real world.

https://zenodo.org/record/850571

Paper (.pdf)
Updated: 02-12-2019 12:53 - Size: 693.2 KB

BibTeX record (.bib)
Updated: 02-12-2019 12:53 - Size: 456 B

Automated Tonal Balance Enhancement for Audio Mastering Applications [Conference]

Stylianos Ioannis Mimilakis, Konstantinos Drossos, Andreas Floros, and Dionisios Katerelos, “Automated Tonal Balance Enhancement for Audio Mastering Applications”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013

Modern audio mastering procedures are involved with the selective enhancement or attenuation of specific frequency bands. The main reason is the tonal enhancement of the original / unmastered audio material. The aforementioned process is mostly based on the musical information and the mode of the audio material. This information can be retrieved from a listening procedure of the original stimuli, or the correspondent musical key notes. The current work presents an adaptive and automated equalization system that performs the aforementioned mastering procedure, based on a novel method of fundamental frequency tracking. In addition to this, the overall system is being evaluated with objective PEAQ analysis and subjective listening tests in real mastering audio conditions.

http://www.aes.org/e-lib/browse.cfm?elib=16737

Paper (.pdf)
Updated: 02-12-2019 11:34 - Size: 563.17 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:34 - Size: 317 B

Gestural User Interface for Audio Multitrack Real-time Stereo Mixing [Conference]

Konstantinos Drossos, Konstantinos Koukoudis, and Andreas Floros, “Gestural User Interface for Audio Multitrack Real-time Stereo Mixing," in proceedings of the 8th Conference on Interaction with Sound - Audio Mostly 2013, Sep. 18–20, Piteå, Sweden, 2013

Sound mixing is a well-established task applied (directly or indirectly) in many fields of music and sound production. For example, in the case of classical music orchestras, their conductors perform sound mixing by specifying the reproduction gain of specific groups of musical instruments or of the entire orchestra. Moreover, modern sound artists and performers also employ sound mixing when they compose music or improvise in real-time. In this work a system is presented that incorporates a gestural interface for real-time multitrack sound mixing. The proposed gestural sound mixing control scheme is implemented on an open hardware micro-controller board, using common sensor modules. The gestures employed are as close as possible to the ones particularly used by the orchestra conductors. The system overall performance is also evaluated in terms of the achieved user experience through subjective tests.

https://dl.acm.org/citation.cfm?id=2544123

Paper (.pdf)
Updated: 02-12-2019 11:56 - Size: 3.27 MB

BibTeX record (.bib)
Updated: 02-12-2019 11:56 - Size: 581 B

Investigating Auditory Human-Machine Interaction: Analysis and Classification of Sounds Commonly Used by Consumer Devices [Conference]

Konstantinos Drossos, Rigas Kotsakis, Panos Pappas, George Kalliris, and Andreas Floros, “Investigating Auditory Human-Machine Interaction: Analysis and Classification of Sounds Commonly Used by Consumer Devices”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013

Many common consumer devices use a short sound indication for declaring various modes of their functionality, such as the start and the end of their operation. This is likely to result in an intuitive auditory human-machine interaction, imputing a semantic content to the sounds used. In this paper we investigate sound patterns mapped to "Start" and "End" of operation manifestations and explore the possibility such semantics’ perception to be based either on users’ prior auditory training or on sound patterns that naturally convey appropriate information. To this aim, listening and machine learning tests were conducted. The obtained results indicate a strong relation between acoustic cues and semantics along with no need of prior knowledge for message conveyance.

http://www.aes.org/e-lib/browse.cfm?elib=16713

Paper (.pdf)
Updated: 02-12-2019 11:37 - Size: 438.26 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:37 - Size: 431 B

Sound Events and Emotions: Investigating the Relation of Rhythmic Characteristics and Arousal [Conference]

Konstantinos Drossos, Rigas Kotsakis, George Kalliris, and Andreas Floros, “Sound Events and Emotions: Investigating the Relation of Rhythmic Characteristics and Arousal”, in proceedings of the 4th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA 2013), Jul. 10–12, Piraeus, Greece, 2013

A variety of recent researches in Audio Emotion Recognition (AER) outlines high performance and retrieval accuracy results. However, in most works music is considered as the original sound content that conveys the identified emotions. One of the music characteristics that is found to represent a fundamental means for conveying emotions are the rhythm-related acoustic cues. Although music is an important aspect of everyday life, there are numerous non-linguistic and nonmusical sounds surrounding humans, generally defined as sound events (SEs). Despite this enormous impact of SEs to humans, a scarcity of investigations regarding AER from SEs is observed. There are only a few recent investigations concerned with SEs and AER, presenting a semantic connection between the former and the listener's triggered emotion. In this work we analytically investigate the connection of rhythm-related characteristics of a wide range of common SEs with the arousal of the listener using sound events with semantic content. To this aim, several feature evaluation and classification tasks are conducted using different ranking and classification algorithms. High accuracy results are obtained, demonstrating a significant relation of SEs rhythmic characteristics to the elicited arousal.

https://ieeexplore.ieee.org/document/6623709

Paper (.pdf)
Updated: 02-12-2019 11:51 - Size: 173.64 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:51 - Size: 382 B

Affective Acoustic Ecology: Towards Emotionally Enhanced Sound Events [Conference]

Konstantinos Drossos, Andreas Floros, and Nikolaos Kanellopoulos, “Affective Acoustic Ecology: Towards Emotionally Enhanced Sound Events”, in proceedings of the 7th Conference on Interaction with Sound - Audio Mostly 2012, Sep. 26 – 28, Corfu, Greece, 2012

Sound events can carry multiple information, related to the sound source and to ambient environment. However, it is well-known that sound evokes emotions, a fact that is verified by works in the disciplines of Music Emotion Recognition and Music Information Retrieval that focused on the impact of music to emotions. In this work we introduce the concept of affective acoustic ecology that extends the above relation to the general concept of sound events. Towards this aim, we define sound event as a novel audio structure with multiple components. We further investigate the application of existing emotion models employed for music affective analysis to sonic, non-musical, content. The obtained results indicate that although such application is feasible, no significant trends and classification outcomes are observed that would allow the definition of an analytic relation between the technical characteristics of a sound event waveform and raised emotions.

https://dl.acm.org/citation.cfm?id=2371474

Paper (.pdf)
Updated: 02-12-2019 11:23 - Size: 574.5 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:23 - Size: 693 B

Emergency Voice/Stress - level Combined Recognition for Intelligent House Applications [Conference]

Konstantinos Drossos, Andreas Floros, Kyriakos Agavanakis, Nikolas-Alexander Tatlas and Nikolaos Kanellopoulos, “Emergency Voice/Stress - level Combined Recognition for Intelligent House Applications”, in proceedings of the 132nd Audio Engineering Convention, Apr. 26–29, Budapest, Hungary, 2012

Legacy technologies for word recognition can benefit from emerging affective voice retrieval, potentially leading to intelligent applications for smart houses enhanced with new features. In this work we introduce the implementation of a system, capable to react to common spoken words, taking into account the estimated vocal stress level, thus allowing the realization of a prioritized, affective aural interaction path. Upon the successful word recognition and the corresponding stress level estimation, the system triggers particular affective-prioritized actions, defined within the application scope of an intelligent home environment. Application results show that the established affective interaction path significantly improves the ambient intelligence provided by an affective vocal sensor that can be easily integrated with any sensor-based home monitoring system.

http://www.aes.org/e-lib/browse.cfm?elib=16253

Paper (.pdf)
Updated: 02-12-2019 11:12 - Size: 606.54 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:12 - Size: 431 B

iReflectors – Smart Acoustical Composite Reflectors [Conference]

Dionisios Katerelos, Konstantinos Drossos, Anastasios Kokkinos, and Stylianos Ioannis Mimilakis, “iReflectors - Intelligent Reflectors from Composite Materials”, in proceedings of the 6th Greek National Conference Acoustics 2012, Oct. 8–10, Corfu, Greece, 2012

The use of reflectors for the optimal sound diffusion is a major issue in Room Acoustics. Up to now, the applied reflectors are stable, with certain shape and made by conventional materials. In the present is studied the possibility to replace the conventional reflectors by new, manufactured by composite materials. The aim is to design flexible “intelligent” reflectors that will adapt their shape depending on the certain acoustical needs of a room. This change is planned to be actuated using embedded shape memory alloy (SMA) wires. The adaptation process will be controlled automatically by an electronic system. In order to control damage initiation and growth within the composite panel, an optical fibres network will be applied.

https://conferences.helina.gr/2012/en/

Paper (.pdf)
Updated: 02-12-2019 11:29 - Size: 843.49 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:29 - Size: 369 B

Smart microphone sensor system platform [Conference]

Elias Kokkinis, Konstantinos Drossos, Nikolas-Alexander Tatlas, Andreas Floros, Alexandros Tsilfidis and Kyriakos Agavanakis, “Smart microphone sensor system platform”, in proceedings of the 132nd Audio Engineering Society Convention, Apr. 26–29, Budapest, Hungary, 2012

A platform for a flexible, smart microphone system using available hardware components is presented. Three subsystems are employed, specifically: (a) a set of digital MEMs microphones, with a one-bit serial output; (b) a preprocessing/digital-to-digital converter; and (c) a CPU/DSP-based embedded system with I2S connectivity. Basic preprocessing functions, such as noise gating and filtering can be performed in the preprocessing stage, while application-specific algorithms such as word spotting, beam-forming, and reverberation suppression can be handled by the embedded system. Widely used high-level operating systems are supported including drivers for a number of peripheral devices. Finally, an employment scenario for a wireless home automation speech activated front-end sensor system using the platform is analyzed.

http://www.aes.org/e-lib/browse.cfm?elib=16604

Paper (.pdf)
Updated: 02-12-2019 11:09 - Size: 734.14 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:09 - Size: 388 B

Stereo Goes Mobile: Spatial Enhancement for Short-distance Loudspeaker Setups [Conference]

Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, and Nikolaos Kanellopoulos, “Stereo Goes Mobile: Spatial Enhancement for Short-distance Loudspeaker Setups”, in proceedings of the 8th International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), Jul. 18–20, Piraeus, Greece, 2012

Modern mobile, hand-held devices offer enhanced capabilities for video and sound reproduction. Nevertheless, major restrictions imposed by their limited size render them inconvenient for headset-free stereo sound reproduction, since the corresponding short-distant loudspeakers placement physically narrows the perceived stereo sound localization potential. In this work, we aim at evaluating a spatial enhancement technique for small-size mobile devices. This technique extracts the original panning information from an original stereo recording and spatially extends it using appropriate binaural rendering. A sequence of subjective tests performed shows that the derived spatial perceptual impression is significantly improved in all test cases considered, rendering the proposed technique an attractive approach towards headset-free mobile audio reproduction.

https://ieeexplore.ieee.org/document/6274275

Paper (.pdf)
Updated: 02-12-2019 11:17 - Size: 164.88 KB

BibTeX record (.bib)
Updated: 02-12-2019 11:17 - Size: 476 B

Emotional Control and Visual Representation Using Advanced Audiovisual Interaction [Journal]

Vassilis Psarras, Andreas Floros, Konstantinos Drossos, and Marianne Strapatsakis, “Emotional Control and Visual Representation Using Advanced Audiovisual Interaction”, International Journal of Arts and Technology, Vol. 4 (4), 2011, pp. 480-498

Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations with non-traditional interaction/feedback paths based on user affective state. In this work, the ‘Elevator’ interactive audiovisual platform prototype is presented, which aims to provide a framework for signalling and expressing human behaviour related to emotions (such as anger) and finally produce a visual outcome of this behaviour, defined here as the emotional ‘thumbnail’ of the user. Optimised, real-time audio signal processing techniques are employed for monitoring the achieved anger-like behaviour, while the emotional elevation is attempted using appropriately selected combined audio/visual content reproduced using state-of-the-art audiovisual playback technologies that allow the creation of a realistic immersive audiovisual environment. The demonstration of the proposed prototype has shown that affective interaction is possible, allowing the further development of relative artistic and technological applications.

https://www.inderscience.com/info/inarticle.php?artid=43446

Paper (.pdf)
Updated: 02-11-2019 12:02 - Size: 950.71 KB

BibTeX record (.bib)
Updated: 29-11-2019 11:18 - Size: 354 B

Binaural Mixing Using Gestural Control Interaction [Conference]

Nikolas Grigoriou, Andreas Floros, and Konstantinos Drossos, “Binaural Mixing Using Gestural Control Interaction”, in proceedings of the 5th Conference on Interaction with Sound - Audio Mostly 2010, Sep. 15–17, Piteå, Sweden, 2010

In this work a novel audio binaural mixing platform is presented which employs advanced gestural-based interaction techniques for controlling the mixing parameters. State-of-the-art binaural technology algorithms are used for producing the final two-channel binaural signal. These algorithms are optimized for realtime operation, able to manipulate high-quality audio (typically 24bit / 96kHz) for an arbitrary number of fixed-position or moving sound sources in closed acoustic enclosures. Simple gestural rules are employed, which aim to provide the complete functionality required for the mixing process, using low cost equipment. It is shown that the proposed platform can be efficiently used for general audio mixing / mastering purposes, providing an attractive alternative to legacy hardware control designs and software-based mixing user interfaces.

https://dl.acm.org/citation.cfm?id=1859803

Paper (.pdf)
Updated: 02-12-2019 10:57 - Size: 1.13 MB

BibTeX record (.bib)
Updated: 02-12-2019 10:57 - Size: 649 B

Towards an Interactive e-Learning System Based on Emotion and Affective Cognition [Conference]

Panagiotis Vlamos, Andreas Floros, Michail Giannakos, Konstantinos Drossos, “Towards an Interactive e-Learning System Based on Emotion and Affective Cognition”, in proceedings of the International Conference on Information Communication Technologies and Education (ICICTE), Jul. 8–10, Corfu, Greece, 2012, pp 367–376

In order to promote a more dynamic and flexible communication between the learner and the system, we present a structure of a new innovative and interactive e-learning system which implements emotion and level of cognition recognition. The system has as inputs the emotional and cognitive state of the user and re-organises the content and adjusts the flow of the course. Our concept aims to increase the learning efficiency of intelligent tutoring systems by using a combination of characteristics, such as content customization and user’s emotion recognition, and adapting all these features into a learner-centered educational system.

Paper (.pdf)
Updated: 02-12-2019 10:50 - Size: 222.48 KB

BibTeX record (.bib)
Updated: 02-12-2019 10:50 - Size: 349 B

On The Adsorption - Desorption Relaxation Time Of Carbon In Very Narrow Ducts [Conference]

Timothy Mellow, Olga Umnova, Konstantinos Drossos, Keith Holland, Andrew Flewitt and Leo Kärkkäinen, “On The Adsorption - Desorption Relaxation Time Of Carbon In Very Narrow Ducts”, in proceedings of the Acoustics 08 conference, Jun. 29–Jul. 04, Paris, France, 2008

Loudspeakers generally have boxes to prevent rear wave cancellation at low frequencies. However, the stiffness of the air in a small box reduces the diaphragm’s excursion at low frequencies. Hence the box size is generally a compromise between low frequency performance and practicality. Activated carbon has been found to increase the apparent size of a given box through adsorption of the air molecules when the pressure increases and likewise desorption when it decreases. However, the exact viscous effects in the granular structure are difficult to model. Thus it is impossible determine the high frequency limit due to the natural adsorption/desorption relaxation time in the absence of viscous losses. In this study, a tube model is presented which takes into account viscous and thermal losses with boundary slip together with adsorption. Impedance measurements are performed on an array of 12 million holes, each 2 micrometers in diameter, etched in a 0.25 mm thick silicon wafer so that the viscous and thermal losses can be verified against the model without adsorption. Impedance measurements are then performed on an array of holes coated with graphite in order to create an activated carbon- like structure, thus enabling the adsorption/desorption relaxation time to be evaluated.

http://www.conforg.fr/acoustics2008/cdrom/data/articles/000871.pdf

Paper (.pdf)
Updated: 01-11-2019 20:28 - Size: 484.5 KB

BibTeX record (.bib)
Updated: 29-11-2019 11:19 - Size: 500 B

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

BiBTeX Record

Konstantinos Drossos®

Konstantinos Drossos^®