Publications

All 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2008

Keyword: deep learning (1) Back

2021

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning [Conference]

B. Weck, X. Favory, K. Drossos, and X. Serra, "Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 60-64, Barcelona, Spain, 2021

Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions.

https://arxiv.org/abs/2110.07410

Paper (.pdf)
Updated: 15-03-2022 09:27 - Size: 426.4 KB

Show/Hide All

Publications

Konstantinos Drossos®

Subscribe to my newsletter

Konstantinos Drossos^®