Creating large multimodal datasets for machine learning tasks can be difficult. Annotating large amounts of data for the dataset is costly and time consuming if done by finding and hiring participants. This thesis outlines a method for gathering multimodal annotations with the crowdsourcing platform Amazon Mechanical Turk (AMT). Specifically, the method in this thesis is made for annotating audio files with five captions and subjective scores for description accuracy and fluency for each caption. The durations of the audio files used in this thesis are uniformly distributed from 15 to 30 seconds. The method divides the whole annotation task into three separate tasks, namely audio description, description editing and description scoring. The editing and scoring tasks were introduced to attempt to fix errors from the previous tasks. 

The inputs for the audio description task are the audio files that are to be annotated. The inputs for the description editing task are the descriptions from the audio description task, and the inputs for the description scoring task are the descriptions from the previous tasks. Each audio file is described five times, each description is edited once, and each set of descriptions is scored three times. At the end of the process there are ten descriptions for each audio file and three scores for accuracy and fluency for each description. The scores are used to sort the descriptions and the top five descriptions are used as the final captions for the files. This thesis creates an audio captioning dataset using this method for 5,000 audio files.

Associated publications

  1. S. Lipping, K. Drossos, and T. Virtanen, “Crowdsourcing a Dataset of Audio Captions,” in Proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019
    Available here.
  2. K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
    Available here.


Online document