Publications

Clotho-AQA: A crowdsourced dataset for audio question answering

S. Lipping, P. Sudarsanam, K. Drossos, and T. Virtanen, "Clotho-AQA: A crowdsourced dataset for audio question answering," in Proceedings of the 30th European Signal Processing Conference (EUSIPCO), pp. 1140-1144, Belgrade, Serbia, 2022

Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have ‘yes’ and ‘no’ as answers, while the remaining two questions have other singleword answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task — a Long short-term memory (LSTM) based multimodal binary classifier for ‘yes’ or ‘no’ type answers and an LSTM based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at https://zenodo.org/record/6473207.

https://ieeexplore.ieee.org/document/9909680

Paper (.pdf)
Updated: 10-09-2025 12:46 - Size: 313.61 KB

BibTex record (.bib)
Updated: 10-09-2025 12:46 - Size: 309 B

Domestic activity clustering from audio via depthwise separable convolutional autoencoder network

Y. Li, W. Cao, K. Drossos and T. Virtanen, "Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network," in Proceedings of the IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, pp.1-6, 2022

Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people. This study focuses on solving the problem of domestic activity clustering from audio. The target of domestic activity clustering is to cluster audio clips which belong to the same category of domestic activity into one cluster in an unsupervised way. In this paper, we propose a method of domestic activity clustering using a depthwise separable convolutional autoencoder network. In the proposed method, initial embeddings are learned by the depthwise separable convolutional autoencoder, and a clustering-oriented loss is designed to jointly optimize embedding refinement and cluster assignment. Different methods are evaluated on a public dataset (a derivative of the SINS dataset) used in the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) in 2018. Our method obtains the normalized mutual information (NMI) score of 54.46%, and the clustering accuracy (CA) score of 63.64%, and outperforms state-of-the-art methods in terms of NMI and CA. In addition, both computational complexity and memory requirement of our method is lower than that of previous deep-model-based methods.

https://ieeexplore.ieee.org/document/9949512

Paper (.pdf)
Updated: 10-09-2025 12:51 - Size: 1.42 MB

BibTex record (.bib)
Updated: 10-09-2025 12:51 - Size: 341 B

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

H. Xie, O. Räsänen, K. Drossos, and T. Virtanen, "Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 22-27, Singapore, Singapore, 2022

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.

https://arxiv.org/abs/2110.02939

Paper (.pdf)
Updated: 15-03-2022 09:14 - Size: 735.95 KB

BiBTeX Record

BiBTeX Record

Konstantinos Drossos®

Konstantinos Drossos^®