Automated audio captioning is a multi-modal task in which the system receives an audio sample as an input and generates a text (a caption) that describes the information presented in the audio. The system not only detects sound events and acoustic scenes, but also learns the spatio-temporal relationships between the events and scenes. Though recently introduced, audio captioning has received considerable attention and was published in the Detection and Classification of Acoustic Scenes and Events 2020 (DCASE 2020) competition. 

This thesis proposes an architecture for audio captioning called WaveTransformer. WaveTransformer employs an encoder-decoder scheme, where the encoder exploits temporal and local time-frequency features, and the decoder aligns the encoder output with the text embeddings. The encoder is a fusion of three learnable processes: a WaveNet-based model to extract temporal information, a depth-wise separable convolution to exploit spatial information, and convolution with fully connected layers to merge the two above processes. The decoder, on the other hand, employs a state-of-the-art architecture in language modeling called Transformer. The thesis also illustrates different post-processing techniques that optimize the caption generation procedure and provide better results. 

The proposed model is trained and evaluated on the publicly available datasets of the Clotho dataset: the development and evaluation splits, amounting to 4981 audio samples and 24 905 captions (5 captions for each sample). The generated captions are rated by ten COCO Caption metrics, including BLEU(1-4), ROUGE(L), METEOR, SPICE, CIDEr, and SPIDEr, in which SPIDEr is the primary metric. 

Experimental results show that both temporal and spatial features are essential to the quality of the generated captions. WaveTransformer is also one of the highest-scoring (SPIDEr) architectures in audio captioning, with the SPIDEr scores of 17.27 (without post-processing) and 18.16 (with post-processing).

Associated publications

  1. A. Tran, K. Drossos, and T. Virtanen, “WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information,” in Proceedings of the European Signal Processing Conference (EUSIPCO), Aug. 23–27, Dublin, Ireland, 2021
    Available here.


Online document