X. Favory, K. Drossos, T. Virtanen, and X. Serra, "Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags," in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 6-11, Torono, Canada, 2021
Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
A. Tran, K. Drossos, and T. Virtanen, "WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information," in proceedings of 29th European Signal Processing Conference (EUSIPCO), Aug. 23-27, Dublin, Ireland, 2021
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.
Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen, “Clotho: An Audio Captioning Dataset,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, and Xavier Serra, "COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations," in International Conference on Machine Learning (ICML), Workshop on Self-supervised learning in Audio and Speech, Jul. 17, virtually held, 2020
Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.
Pyry Pyykkönen, Styliannos I. Mimilakis, Konstantinos Drossos, and Tuomas Virtanen, "Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation," in proceedings of the 22nd IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 21-24, Tampere, Finland, 2020
Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.
Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti, "Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Aug. 24 - 28, Amsterdam, Netherlands, 2020
Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2.7%.
Emre Çakır, Konstantinos Drossos, Tuomas Virtanen, "Multi-task Regularization Based on Infrequent Classes for Audio Captioning," in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Nov. 2-3, Tokyo, Japan (full virtual), 2020
Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, i.e. the function words (e.g. "a", "the"), and other words are infrequent but informative, i.e. the content words (e.g. adjectives, nouns). In this paper we propose two methods to mitigate this class imbalance problem. First, in an autoencoder setting for audio captioning, we weigh each word's contribution to the training loss inversely proportional to its number of occurrences in the whole dataset. Secondly, in addition to multi-class, word-level audio captioning task, we define a multi-label side task based on clip-level content word detection by training a separate decoder. We use the loss from the second task to regularize the jointly trained encoder for the audio captioning task. We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset, and our results show an increase of 37% relative improvement with SPIDEr metric over the baseline method.
Antonio J. Muñoz-Montoro, Julio J. Carabias-Orti, Archontis Politis, and Konstantinos Drossos, "Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CNMF," in proceedings of the 22nd IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 21-24, Tampere, Finland, 2020
This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep-learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser Twin Network (MaD TwinNet), able to model long-term temporal patterns of a musical piece. The monophonic source spectrograms are used within a spatial covariance mixing model based on Complex Non-Negative Matrix Factorization (CNMF) that predicts the spatial characteristics of each source. The proposed framework is evaluated on the task of singing voice separation with a large multichannel dataset. Experimental results show that our joint DL+CNMF method outperforms both the individual monophonic DL-based separation and the multichannel CNMF baseline methods.
Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen, “Sound event detection via dilated convolutional recurrent neural networks,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9%, 6.3% and 2.5% at F1 score and a maximum decrease of 1.7%, 4.1% and 3.9% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.
Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen “Sound Event Detection with Depthwise Separable and Dilated Convolutions,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 19–24, Glasgow, Scotland, 2020
State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions. Additionally, the usually long sequences that are used as an input to an SED method along with the employment of an RNN, introduce implications like increased training time, difficulty at gradient flow, and impeding the parallelization of the SED method. To tackle all these problems, we propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions. We compare the proposed method to a baseline convolutional neural network on a SED task, and achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively.
Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen, "Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning," in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Nov. 2-3, Tokyo, Japan (full virtual), 2020
Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.
Stylianos I. Mimilakis, Konstantinos Drossos, Gerald Schuller, "Unsupervised Interpretable Representation Learning for Singing Voice Separation," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Jan. 18 - 22 (2021), Amsterdam, Netherlands, 2020
In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of our method, we employ the obtained representations to the task of informed singing voice separation via binary masking, and measure the obtained separation quality by means of scale-invariant signal to distortion ratio. Our findings suggest that our method is capable of learning meaningful representations for singing voice separation, while preserving conveniences of the the short-time Fourier transform like non-negativity, smoothness, and reconstruction subject to time-frequency masking, that are desired in audio and music source separation.
Samuel Lipping, Konstantinos Drossos, and Tuomas Virtanen, “Crowdsourcing a Dataset of Audio Captions,” in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019
Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.
Konstantinos Drossos, Shayan Gharib, Paul Magron, and Tuomas Virtanen, “Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling,” in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019
A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 6% and 3% at the F1 (higher is better) and a decrease of 3% and 2% at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 10% at F1 score and an increase of 11% at ER for the TUT-SED Synthetic 2016 dataset.
Konstantinos Drossos, Paul Magron, and Tuomas Virtanen, “Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification,” accepted for publication at the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 20–23, N. Paltz, NY, U.S.A., 2019
A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32% to 45%, using the TUT Acoustic Scenes dataset.
Stylianos Ioannis Mimilakis, Estafanía Cano, Derry Fitzgerald, Konstantinos Drossos, and Gerald Schuller, “Examining the Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation,” in proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, Oct. 28–31, Pacific Grove, CA, U.S.A., 2018
In this study, we examine the effect of various objective functions used to optimize the recently proposed deep learning architecture for singing voice separation MaD - Masker and Denoiser. The parameters of the MaD architecture are optimized using an objective function that contains a reconstruction criterion between predicted and true magnitude spectra of the singing voice, and a regularization term. We examine various reconstruction criteria such as the generalized Kullback-Leibler, mean squared error, and noise to mask ratio. We also explore recently proposed, for optimizing MaD, regularization terms such as sparsity and TwinNetwork regularization. Results from both objective assessment and listening tests suggest that the TwinNetwork regularization results in improved singing voice separation quality.
Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery,” in proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 17–20, Tokyo, Japan, 2018
Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-time Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitry Serdyuk, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 8–13, Rio de Janeiro, Brazil, 2018
Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time- Frequency Mask,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 15–20, Calgary, Canada, 2018
Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.
Paul Magron, Konstantinos Drossos, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation,” in proceedings of the INTERSPEECH 2018, Sep. 2–6, Hyderabad, India, 2018
State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-time Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency, a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.
Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitry Serdyuk, and Tuomas Virtanen, “Unsupervised adversarial domain adaptation for acoustic scene classification,” in proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Non. 19–20, Surrey, U.K., 2018
A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼10% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “A Recurrent Encoder-Decoder Approach with Skip-Filtering Connections for Monaural Singing Voice Separation, ” in proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 25–28, Tokyo, Japan, 2017
The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.
Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen, “Automated Audio Captioning with Recurrent Neural Networks,” in proceedings of the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 15–18, New Paltz, N.Y. U.S.A., 2017.
We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, Tuomas Virtanen, and Gerald Schuller, “Close Miking Empirical Practice Verification: A Source Separation Approach”, in proceedings of the 142nd Audio Engineering Society (AES) Convention, May 20–23, Berlin, Germany, 2017
Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we calculate the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording.
Emre Çakir, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, and Tuomas Virtanen, “Convolutional Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge.
Sharath Adavanne, Konstantinos Drossos, Emre Çakir, and Tuomas Virtanen, “Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.
Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, and Roman Jarina, “Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition”, in proceedings of the 14th Sound and Music Computing (SMC) conference, Jul. 5–8, Helsinki, Finland, 2017
This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the “MediaEval2015 emotion in music” dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “Deep Neural Networks for Dynamic Range Compression in Mastering Applications”, in proceedings of the 140th Audio Engineering Society (AES) Convention, Jul. 4–7, Paris, France, 2016
The process of audio mastering often, if not always, includes various audio signal processing techniques such as frequency equalization and dynamic range compression. With respect to the genre and style of the audio content, the parameters of these techniques are controlled by a mastering engineer, in order to process the original audio material. This operation relies on musical and perceptually pleasing facets of the perceived acoustic characteristics, transmitted from the audio material under the mastering process. Modeling such dynamic operations, which involve adaptation regarding the audio content, becomes vital in automated applications since it significantly affects the overall performance. In this work we present a system capable of modelling such behavior focusing on the automatic dynamic range compression. It predicts frequency coefficients that allow the dynamic range compression, via a trained deep neural network, and applies them to unmastered audio signal served as input. Both dynamic range compression and the prediction of the corresponding frequency coefficients take place inside the time-frequency domain, using magnitude spectra acquired from a critical band filter bank, similar to humans’ peripheral auditory system. Results from conducted listening tests, incorporating professional music producers and audio mastering engineers, demonstrate on average an equivalent performance compared to professionally mastered audio content. Improvements were also observed when compared to relevant and commercial software.
Konstantinos Drossos, Nikolaos Zormpas, George Giannakopoulos, and Andreas Floros, “Accessible Games for Blind Children, Empowered by Binaural Sound," in proceedings of the 8th Pervasive Technologies Related to Assistive Environments (PETRA) Conference, Jul. 1–3, Corfu, Greece, 2015
Accessible games have been researched and developed for many years, however, blind people still have very limited access and knowledge of them. This can pose a serious limitation, especially for blind children, since in recent years electronic games have become one of the most common and wide spread means of entertainment and socialization. For our implementation we use binaural technology which allows the player to hear and navigate the game space by adding localization information to the game sounds. With our implementation and user studies we provide insight on what constitutes an accessible game for blind people as well as a functional game engine for such games. The game engine developed allows the quick development of games for the visually impaired. Our work provides a good starting point for future developments on the field and, as the user studies show, was very well perceived by the visually impaired children that tried it.
Konstantinos Drossos, Andreas Floros, and Nikolaos Kanellopoulos, “A Loudness-based Adaptive Equalization Technique for Subjectively Improved Sound Reproduction," in proceedings of the Audio Engineering Society (AES) 136th convention, Apr. 26–29, Berlin, Germany, 2014.
Sound equalization is a common approach for objectively or subjectively defining the reproduction level at specific frequency bands. It is also well-known that the human auditory system demonstrates an inner process of sound-weighting. Due to this, the perceived loudness changes with the frequency and the user-defined sound reproduction gain, resulting into a deviation of the intended and the perceived equalization scheme as the sound level changes. In this work we introduce a novel equalization approach that takes into account the above perceptual loudness effect in order to achieve subjectively constant equalization. A series of listening tests shows that the proposed equalization technique is an efficient and listener-preferred alternative for both professional and home audio reproduction applications.
Konstantinos Drossos, Andreas Floros, Stelios Potirakis, Nikolas-Alexander Tatlas, and Gurkan Tuna, “A socially-intelligent multirobot service team for in-home monitoring," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.
The objective of this study is to develop a socially-intelligent service team comprised of multiple robots with sophisticated sonic interaction capabilities that aims to transparently collaborate towards efficient and robust monitoring by close interaction. In the distributed scenario proposed in this study, the robots share any acoustic data extracted from the environment and act in-sync with the events occurring in their living environment in order to provide potential means for efficient monitoring and decision-making within a typical home enclosure. Although each robot acts as an individual recognizer using a novel emotionally-enriched word recognition system, the final decision is social in nature and is followed by all. Moreover, the social decision stage triggers actions that are algorithmically distributed among the robots' population and enhances the overall approach with the potential advantages of the team work within specific communities through collaboration.
Konstantinos Drossos, Andreas Floros, and Andreas Giannakoulopoulos, “BEADS: A Dataset of Binaural Emotionally Annotated Digital Sounds," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.
Emotion recognition from generalized sounds is an interdisciplinary and emerging field of research. A vital requirement for this kind of investigations is the availability of ground truth datasets. Currently, there are 2 freely available datasets of emotionally annotated sounds, which, however, do not include sound evenets (SEs) with manifestation of the spatial location of the source. The latter is an inherent natural component of SEs, since all sound sources in real-world conditions are physically located and perceived somewhere in the listener's surrounding space. In this work we present a novel emotionally annotated sounds dataset consisting of 32 SEs that are spatially rendered using appropriate binaural processing. All SEs in the dataset are available in 5 spatial positions corresponding to source/receiver angles equal to 0, 45, 90, 135 and 180 degrees. We have used the IADS dataset as the initial collection of SEs prior to binaural processing. The annotation measures obtained for the novel binaural dataset demonstrate a significant accordance with the existing IADS dataset, while small ratings declinations illustrate a perceptual adaptation imposed by the more realistic SEs spatial representation.
Maximos Kaliakatsos–Papakostas, Andreas Floros, Konstantinos Drossos, Konstantinos Koukoudis, Manolis Kuzalas, and Achileas Kalantzis, “Swarm Lake: A Game of Swarm Intelligence, Human Interaction and Collaborative Music Composition," in proceedings of the Joint Conference ICMC/SMC 2014, Sep. 14–20, Athens, Greece, 2014.
In this work we aim to combine a game platform with the concept of collaborative music synthesis. We use bio-inspired intelligence for developing a world - the Lake - where multiple tribes of artiﬁcial, autonomous agents live within, having survival as their ultimate goal. The tribes exhibit primitive social swarm-based behavior and intelligence, which is used for taking actions that will potentially allow to dominate the game world. Tribes’ populations also demonstrate a number of physical properties that re-strict their ability to act illimitably. Multiuser interventionis employed in parallel, affecting the automated decisions and the physical parameters of the tribes, thus infusing the gaming orientation of the application context. Finally,sound synthesis is achieved through a complex mapping scheme established between the events occurring in the Lake and the rhythmic, harmonic and dynamic-range parameters of an advanced, collaborative sound composition engine. This complex mapping scheme allows the production of interesting and complicated sonic patterns that fol-low the performance evolution in both objective and conceptual levels. The overall synthesis process is controlled by the conductor, a virtual entity that determines the synthesis evolution in a way that is very similar to directing an ensemble performance in real world.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Andreas Floros, and Dionisios Katerelos, “Automated Tonal Balance Enhancement for Audio Mastering Applications”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013
Modern audio mastering procedures are involved with the selective enhancement or attenuation of specific frequency bands. The main reason is the tonal enhancement of the original / unmastered audio material. The aforementioned process is mostly based on the musical information and the mode of the audio material. This information can be retrieved from a listening procedure of the original stimuli, or the correspondent musical key notes. The current work presents an adaptive and automated equalization system that performs the aforementioned mastering procedure, based on a novel method of fundamental frequency tracking. In addition to this, the overall system is being evaluated with objective PEAQ analysis and subjective listening tests in real mastering audio conditions.
Konstantinos Drossos, Konstantinos Koukoudis, and Andreas Floros, “Gestural User Interface for Audio Multitrack Real-time Stereo Mixing," in proceedings of the 8th Conference on Interaction with Sound - Audio Mostly 2013, Sep. 18–20, Piteå, Sweden, 2013
Sound mixing is a well-established task applied (directly or indirectly) in many fields of music and sound production. For example, in the case of classical music orchestras, their conductors perform sound mixing by specifying the reproduction gain of specific groups of musical instruments or of the entire orchestra. Moreover, modern sound artists and performers also employ sound mixing when they compose music or improvise in real-time. In this work a system is presented that incorporates a gestural interface for real-time multitrack sound mixing. The proposed gestural sound mixing control scheme is implemented on an open hardware micro-controller board, using common sensor modules. The gestures employed are as close as possible to the ones particularly used by the orchestra conductors. The system overall performance is also evaluated in terms of the achieved user experience through subjective tests.
Konstantinos Drossos, Rigas Kotsakis, Panos Pappas, George Kalliris, and Andreas Floros, “Investigating Auditory Human-Machine Interaction: Analysis and Classification of Sounds Commonly Used by Consumer Devices”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013
Many common consumer devices use a short sound indication for declaring various modes of their functionality, such as the start and the end of their operation. This is likely to result in an intuitive auditory human-machine interaction, imputing a semantic content to the sounds used. In this paper we investigate sound patterns mapped to "Start" and "End" of operation manifestations and explore the possibility such semantics’ perception to be based either on users’ prior auditory training or on sound patterns that naturally convey appropriate information. To this aim, listening and machine learning tests were conducted. The obtained results indicate a strong relation between acoustic cues and semantics along with no need of prior knowledge for message conveyance.
Konstantinos Drossos, Rigas Kotsakis, George Kalliris, and Andreas Floros, “Sound Events and Emotions: Investigating the Relation of Rhythmic Characteristics and Arousal”, in proceedings of the 4th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA 2013), Jul. 10–12, Piraeus, Greece, 2013
A variety of recent researches in Audio Emotion Recognition (AER) outlines high performance and retrieval accuracy results. However, in most works music is considered as the original sound content that conveys the identified emotions. One of the music characteristics that is found to represent a fundamental means for conveying emotions are the rhythm-related acoustic cues. Although music is an important aspect of everyday life, there are numerous non-linguistic and nonmusical sounds surrounding humans, generally defined as sound events (SEs). Despite this enormous impact of SEs to humans, a scarcity of investigations regarding AER from SEs is observed. There are only a few recent investigations concerned with SEs and AER, presenting a semantic connection between the former and the listener's triggered emotion. In this work we analytically investigate the connection of rhythm-related characteristics of a wide range of common SEs with the arousal of the listener using sound events with semantic content. To this aim, several feature evaluation and classification tasks are conducted using different ranking and classification algorithms. High accuracy results are obtained, demonstrating a significant relation of SEs rhythmic characteristics to the elicited arousal.
Konstantinos Drossos, Andreas Floros, and Nikolaos Kanellopoulos, “Affective Acoustic Ecology: Towards Emotionally Enhanced Sound Events”, in proceedings of the 7th Conference on Interaction with Sound - Audio Mostly 2012, Sep. 26 – 28, Corfu, Greece, 2012
Sound events can carry multiple information, related to the sound source and to ambient environment. However, it is well-known that sound evokes emotions, a fact that is verified by works in the disciplines of Music Emotion Recognition and Music Information Retrieval that focused on the impact of music to emotions. In this work we introduce the concept of affective acoustic ecology that extends the above relation to the general concept of sound events. Towards this aim, we define sound event as a novel audio structure with multiple components. We further investigate the application of existing emotion models employed for music affective analysis to sonic, non-musical, content. The obtained results indicate that although such application is feasible, no significant trends and classification outcomes are observed that would allow the definition of an analytic relation between the technical characteristics of a sound event waveform and raised emotions.
Konstantinos Drossos, Andreas Floros, Kyriakos Agavanakis, Nikolas-Alexander Tatlas and Nikolaos Kanellopoulos, “Emergency Voice/Stress - level Combined Recognition for Intelligent House Applications”, in proceedings of the 132nd Audio Engineering Convention, Apr. 26–29, Budapest, Hungary, 2012
Legacy technologies for word recognition can benefit from emerging affective voice retrieval, potentially leading to intelligent applications for smart houses enhanced with new features. In this work we introduce the implementation of a system, capable to react to common spoken words, taking into account the estimated vocal stress level, thus allowing the realization of a prioritized, affective aural interaction path. Upon the successful word recognition and the corresponding stress level estimation, the system triggers particular affective-prioritized actions, defined within the application scope of an intelligent home environment. Application results show that the established affective interaction path significantly improves the ambient intelligence provided by an affective vocal sensor that can be easily integrated with any sensor-based home monitoring system.
Dionisios Katerelos, Konstantinos Drossos, Anastasios Kokkinos, and Stylianos Ioannis Mimilakis, “iReflectors - Intelligent Reflectors from Composite Materials”, in proceedings of the 6th Greek National Conference Acoustics 2012, Oct. 8–10, Corfu, Greece, 2012
The use of reflectors for the optimal sound diffusion is a major issue in Room Acoustics. Up to now, the applied reflectors are stable, with certain shape and made by conventional materials. In the present is studied the possibility to replace the conventional reflectors by new, manufactured by composite materials. The aim is to design flexible “intelligent” reflectors that will adapt their shape depending on the certain acoustical needs of a room. This change is planned to be actuated using embedded shape memory alloy (SMA) wires. The adaptation process will be controlled automatically by an electronic system. In order to control damage initiation and growth within the composite panel, an optical fibres network will be applied.
Elias Kokkinis, Konstantinos Drossos, Nikolas-Alexander Tatlas, Andreas Floros, Alexandros Tsilfidis and Kyriakos Agavanakis, “Smart microphone sensor system platform”, in proceedings of the 132nd Audio Engineering Society Convention, Apr. 26–29, Budapest, Hungary, 2012
A platform for a flexible, smart microphone system using available hardware components is presented. Three subsystems are employed, specifically: (a) a set of digital MEMs microphones, with a one-bit serial output; (b) a preprocessing/digital-to-digital converter; and (c) a CPU/DSP-based embedded system with I2S connectivity. Basic preprocessing functions, such as noise gating and filtering can be performed in the preprocessing stage, while application-specific algorithms such as word spotting, beam-forming, and reverberation suppression can be handled by the embedded system. Widely used high-level operating systems are supported including drivers for a number of peripheral devices. Finally, an employment scenario for a wireless home automation speech activated front-end sensor system using the platform is analyzed.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, and Nikolaos Kanellopoulos, “Stereo Goes Mobile: Spatial Enhancement for Short-distance Loudspeaker Setups”, in proceedings of the 8th International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), Jul. 18–20, Piraeus, Greece, 2012
Modern mobile, hand-held devices offer enhanced capabilities for video and sound reproduction. Nevertheless, major restrictions imposed by their limited size render them inconvenient for headset-free stereo sound reproduction, since the corresponding short-distant loudspeakers placement physically narrows the perceived stereo sound localization potential. In this work, we aim at evaluating a spatial enhancement technique for small-size mobile devices. This technique extracts the original panning information from an original stereo recording and spatially extends it using appropriate binaural rendering. A sequence of subjective tests performed shows that the derived spatial perceptual impression is significantly improved in all test cases considered, rendering the proposed technique an attractive approach towards headset-free mobile audio reproduction.
Nikolas Grigoriou, Andreas Floros, and Konstantinos Drossos, “Binaural Mixing Using Gestural Control Interaction”, in proceedings of the 5th Conference on Interaction with Sound - Audio Mostly 2010, Sep. 15–17, Piteå, Sweden, 2010
In this work a novel audio binaural mixing platform is presented which employs advanced gestural-based interaction techniques for controlling the mixing parameters. State-of-the-art binaural technology algorithms are used for producing the final two-channel binaural signal. These algorithms are optimized for realtime operation, able to manipulate high-quality audio (typically 24bit / 96kHz) for an arbitrary number of fixed-position or moving sound sources in closed acoustic enclosures. Simple gestural rules are employed, which aim to provide the complete functionality required for the mixing process, using low cost equipment. It is shown that the proposed platform can be efficiently used for general audio mixing / mastering purposes, providing an attractive alternative to legacy hardware control designs and software-based mixing user interfaces.
Panagiotis Vlamos, Andreas Floros, Michail Giannakos, Konstantinos Drossos, “Towards an Interactive e-Learning System Based on Emotion and Affective Cognition”, in proceedings of the International Conference on Information Communication Technologies and Education (ICICTE), Jul. 8–10, Corfu, Greece, 2012, pp 367–376
In order to promote a more dynamic and flexible communication between the learner and the system, we present a structure of a new innovative and interactive e-learning system which implements emotion and level of cognition recognition. The system has as inputs the emotional and cognitive state of the user and re-organises the content and adjusts the flow of the course. Our concept aims to increase the learning efficiency of intelligent tutoring systems by using a combination of characteristics, such as content customization and user’s emotion recognition, and adapting all these features into a learner-centered educational system.
Timothy Mellow, Olga Umnova, Konstantinos Drossos, Keith Holland, Andrew Flewitt and Leo Kärkkäinen, “On The Adsorption - Desorption Relaxation Time Of Carbon In Very Narrow Ducts”, in proceedings of the Acoustics 08 conference, Jun. 29–Jul. 04, Paris, France, 2008
Loudspeakers generally have boxes to prevent rear wave cancellation at low frequencies. However, the stiffness of the air in a small box reduces the diaphragm’s excursion at low frequencies. Hence the box size is generally a compromise between low frequency performance and practicality. Activated carbon has been found to increase the apparent size of a given box through adsorption of the air molecules when the pressure increases and likewise desorption when it decreases. However, the exact viscous effects in the granular structure are difficult to model. Thus it is impossible determine the high frequency limit due to the natural adsorption/desorption relaxation time in the absence of viscous losses. In this study, a tube model is presented which takes into account viscous and thermal losses with boundary slip together with adsorption. Impedance measurements are performed on an array of 12 million holes, each 2 micrometers in diameter, etched in a 0.25 mm thick silicon wafer so that the viscous and thermal losses can be verified against the model without adsorption. Impedance measurements are then performed on an array of holes coated with graphite in order to create an activated carbon- like structure, thus enabling the adsorption/desorption relaxation time to be evaluated.