Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “A Recurrent Encoder-Decoder Approach with Skip-Filtering Connections for Monaural Singing Voice Separation, ” in proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 25–28, Tokyo, Japan, 2017
The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.
Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen, “Automated Audio Captioning with Recurrent Neural Networks,” in proceedings of the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 15–18, New Paltz, N.Y. U.S.A., 2017.
We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
K. Drossos, S. Mimilakis, A. Floros, T. Virtanen, and G. Schuller, “Close Miking Empirical Practice Verification: A Source Separation Approach”, in proceedings of the 142nd Audio Engineering Society (AES) Convention, May 20–23, Berlin, Germany, 2017
Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we calculate the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording.
Emre Çakir, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, and Tuomas Virtanen, “Convolutional Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge.
Sharath Adavanne, Konstantinos Drossos, Emre Çakir, and Tuomas Virtanen, “Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.
M. Malik, S. Adavanne, K. Drossos, T. Virtanen, D. Ticha, and R. Jarina, “Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition”, in proceedings of the 14th Sound and Music Computing (SMC) conference, Jul. 5–8, Helsinki, Finland, 2017
This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the “MediaEval2015 emotion in music” dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset.