Stylianos Ioannis Mimilakis, Estafanía Cano, Derry Fitzgerald, Konstantinos Drossos, and Gerald Schuller, “Examining the Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation,” in proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, Oct. 28–31, Pacific Grove, CA, U.S.A., 2018
In this study, we examine the effect of various objective functions used to optimize the recently proposed deep learning architecture for singing voice separation MaD - Masker and Denoiser. The parameters of the MaD architecture are optimized using an objective function that contains a reconstruction criterion between predicted and true magnitude spectra of the singing voice, and a regularization term. We examine various reconstruction criteria such as the generalized Kullback-Leibler, mean squared error, and noise to mask ratio. We also explore recently proposed, for optimizing MaD, regularization terms such as sparsity and TwinNetwork regularization. Results from both objective assessment and listening tests suggest that the TwinNetwork regularization results in improved singing voice separation quality.
Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery,” in proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 17–20, Tokyo, Japan, 2018
Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-time Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitry Serdyuk, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 8–13, Rio de Janeiro, Brazil, 2018
Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time- Frequency Mask,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 15–20, Calgary, Canada, 2018
Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.
Paul Magron, Konstantinos Drossos, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation,” in proceedings of the INTERSPEECH 2018, Sep. 2–6, Hyderabad, India, 2018
State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-time Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency, a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.
Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitry Serdyuk, and Tuomas Virtanen, “Unsupervised adversarial domain adaptation for acoustic scene classification,” in proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Non. 19–20, Surrey, U.K., 2018
A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼10% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.