Abstract

Audio captioning is a novel task in machine learning which involves the generation of textual description for an audio signal. For example, a method for audio captioning must be able to generate descriptions like “two people talking about football”, or “college clock striking” from the corresponding audio signals. Audio captioning is one of the tasks in the Detection and Classification of Acoustic Scenes and Events 2020 (DCASE2020). Most audio captioning methods use the encoder-decoder deep neural networks architecture as a function to map the extracted features from input audio sequence to the output captions. However, the length of an output caption is considerably less than the length of an input audio signal, for example, 10 words versus 2000 audio feature vectors. This thesis work reports an attempt to take advantage of this difference in length by employing temporal sub-sampling in the encoder-decoder neural networks. The method is evaluated using the Clotho audio captioning dataset and the DCASE2020 evaluation metrics. Experimental results show that temporal sequence sub-sampling is able to improve all considered metrics, as well as memory and time complexity while training and calculating predicted output.

Associated publications

  1. K. Nguyen, K. Drossos, and T. Virtanen, "Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning," in Proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Nov. 2-3, Tokyo, Japan (full virtual), 2020
    Available here

 

Online document

https://trepo.tuni.fi/handle/10024/123920