Abstract
Machine learning algorithms have achieved the state-of-the-art results by utilizing deep neural networks (DNNs) across different tasks in recent years. However, the performance of DNNs suffers from mismatched conditions between training and test datasets. This is a general machine learning problem in all applications such as machine vision, natural language processing, and audio processing. For instance, the usage of different recording devices and ambient noises can be referred to as some of the causing factors for mismatched conditions between training and test datasets in audio classification tasks. Due to mismatched conditions, a well-performed DNN model in training phase encounters a decrease in performance when evaluated on unseen data. To compensate the reduction of performance caused by this issue, domain adaptation methods have been employed to adapt DNNs to conditions in test dataset.
The objective of this thesis is to study unsupervised domain adaptation using adversarial training for acoustic scene classification. We first pre-train a model using data from one set of conditions. subsequently, we retrain the model using data with another set of conditions in order to adapt the model such that the output of the model is condition-invariant representations of inputs. More specifically, we place a discriminator against the model. The aim of the discriminator is to distinguish between data coming from different conditions, while the goal of the model is to confuse the discriminator such that it is not able to differentiate between data with different conditions. The data that we use to optimize our models on, e.g. training data, have been recorded using different recording devices than the ones utilized in the dataset used to evaluate the model, e.g. test dataset. The training data is recorded using a high quality recording device, while the audio recordings in the test set have been collected using two handheld consumer devices with mediocre quality. This introduces a mismatched condition which negatively affects the performance of optimized DNN. In this thesis, we simulate a scenario in which we do not have access to the annotations of the dataset that we try to adapt our model to. Therefore, our method can be used as an unsupervised domain adaptation technique. In addition, we present our results using two different DNN architectures, namely Kaggle and DCASE models, to show that our method is model agnostic, and works regardless of the used models. The results show a significant improvement for the adapted Kaggle and DCASE models compared to the non-adapted ones by an approximate increase of 11% and 6% respectively in the performance for unseen test data while maintaining the same performance on unseen samples from training data.
Associated publications
- S. Gharib, K. Drossos, E. Çakir, D. Serdyuk, and T. Virtanen, “Unsupervised adversarial domain adaptation for acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Non. 19–20, Surrey, U.K., 2018
Available here.