Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen, “Clotho: An Audio Captioning Dataset,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti, "Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Aug. 24 - 28, Amsterdam, Netherlands, 2020
Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2.7%.
Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen, “Sound event detection via dilated convolutional recurrent neural networks,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9%, 6.3% and 2.5% at F1 score and a maximum decrease of 1.7%, 4.1% and 3.9% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.
Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen “Sound Event Detection with Depthwise Separable and Dilated Convolutions,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 19–24, Glasgow, Scotland, 2020
State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions. Additionally, the usually long sequences that are used as an input to an SED method along with the employment of an RNN, introduce implications like increased training time, difficulty at gradient flow, and impeding the parallelization of the SED method. To tackle all these problems, we propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions. We compare the proposed method to a baseline convolutional neural network on a SED task, and achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively.
Stylianos I. Mimilakis, Konstantinos Drossos, Gerald Schuller, "Unsupervised Interpretable Representation Learning for Singing Voice Separation," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Jan. 18 - 22 (2021), Amsterdam, Netherlands, 2020
In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of our method, we employ the obtained representations to the task of informed singing voice separation via binary masking, and measure the obtained separation quality by means of scale-invariant signal to distortion ratio. Our findings suggest that our method is capable of learning meaningful representations for singing voice separation, while preserving conveniences of the the short-time Fourier transform like non-negativity, smoothness, and reconstruction subject to time-frequency masking, that are desired in audio and music source separation.