H. Xie, O. Räsänen, K. Drossos, and T. Virtanen, "Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 22-27, Singapore, Singapore, 2022
We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.
E. Rovithis, N. Moustakas, K. Vogklis, K. Drossos, and A. Floros, "Design Recommendations for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices," in Journal of the Audio Engineering Society, vo. 69 (12), pp. 956-966, 2021, doi:
Citizen Science aims to engage people in research activities on important issues related to their well-being. Smart Cities aim to provide them with services that improve the quality of their life. Both concepts have seen significant growth in the last years and can be further enhanced by combining their purposes with Internet of Things technologies that allow for dynamic and large-scale communication and interaction. However, exciting and retaining the interest of participants is a key factor for such initiatives. In this paper we suggest that engagement in Citizen Science projects applied on Smart Cities infrastructure can be enhanced through contextual and structural game elements realized through augmented audio interactive mechanisms. Our interdisciplinary framework is described through the paradigm of a collaborative bird call recognition game, in which users collect and submit audio data that are then classified and used for augmenting physical space. We discuss the Playful Learning, Internet of Audio Things, and Bird Monitoring principles that shaped the design of our paradigm and analyze the design issues of its potential technical implementation.
A. Ferraro, X. Favory, K. Drossos, Y. Kim and D. Bogdanov, "Enriched Music Representations with Multiple Cross-modal Contrastive Learning," in IEEE Signal Processing Letters, doi: 10.1109/LSP.2021.3071082.
Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks' audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.
P. Sudarsanam, A. Politis, and K. Drossos, "Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 100-104, Barcelona, Spain, 2021
Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from these high-level features. However, RNNs have some drawbacks, such as a limited capability to model long temporal dependencies and slow training and inference times due to their sequential processing nature. Recently, a few SELD studies used multi-head self-attention (MHSA), among other innovations in their models. MHSA and the related transformer networks have shown state-of-the-art performance in various domains. While they can model long temporal dependencies, they can also be parallelized efficiently. In this paper, we study in detail the effect of MHSA on the SELD task. Specifically, we examined the effects of replacing the RNN blocks with self-attention layers. We studied the influence of stacking multiple self-attention blocks, using multiple attention heads in each self-attention block, and the effect of position embeddings and layer normalization. Evaluation on the DCASE 2021 SELD (task 3) development data set shows a significant improvement in all employed metrics compared to the baseline CRNN accompanying the task.
J. Berg and K. Drossos, "Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 140-144, Barcelona, Spain, 2021
Automated audio captioning (AAC) is the task of automatically creating textual descriptions (i.e. captions) for the contents of a general audio signal. Most AAC methods are using existing datasets to optimize and/or evaluate upon. Given the limited information held by the AAC datasets, it is very likely that AAC methods learn only the information contained in the utilized datasets. In this paper we present a first approach for continuously adapting an AAC method to new information, using a continual learning method. In our scenario, a pre-optimized AAC method is used for some unseen general audio signals and can update its parameters in order to adapt to the new information, given a new reference caption. We evaluate our method using a freely available, pre-optimized AAC method and two freely available AAC datasets. We compare our proposed method with three scenarios, two of training on one of the datasets and evaluating on the other and a third of training on one dataset and fine-tuning on the other. Obtained results show that our method achieves a good balance between distilling new knowledge and not forgetting the previous one.
B. Weck, X. Favory, K. Drossos, and X. Serra, "Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 60-64, Barcelona, Spain, 2021
Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions.
A. Triantafyllopoulos, M. Milling, K. Drossos, and B. - W. Schuller, "Fairness and underspecification in acoustic scene classification: The case for disaggregated evaluations," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 70-74, Barcelona, Spain, 2021
Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community. Acoustic scene classification (ASC) applications have so far remained unaffected by this discussion, but are now becoming increasingly used in real-world systems where fairness and reliability are critical aspects. In this work, we argue for the need of a more holistic evaluation process for ASC models through disaggregated evaluations. This entails taking into account performance differences across several factors, such as city, location, and recording device. Although these factors play a well-understood role in the performance of ASC models, most works report single evaluation metrics taking into account all different strata of a particular dataset. We argue that metrics computed on specific sub-populations of the underlying data contain valuable information about the expected real-world behaviour of proposed systems, and their reporting could improve the transparency and trustability of such systems. We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems exhibited by several standard ML architectures when trained on two widely-used ASC datasets. Our evaluation shows that all examined architectures exhibit large biases across all factors taken into consideration, and in particular with respect to the recording location. Additionally, different architectures exhibit different biases even though they are trained with the same experimental configurations.
X. Favory, K. Drossos, T. Virtanen, and X. Serra, "Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags," in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 6-11, Torono, Canada, 2021
Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
B. - W. Schuller, T. Virtanen, M. Riveiro, G. Rizos, J. Han, A. Mesaros, and K. Drossos, "Towards Sonification in Multimodal and User-friendly Explainable Artificial Intelligence," in Proceedings of the 23rd ACM International Conference on Multimodal Interaction, Oct 18-22, Montreal, Canada, 2021
We are largely used to hearing explanations. For example, if someone thinks you are sad today, they might reply to your “why?” with “because you were so Hmmmmm-mmm-mmm”. Today’s Artificial Intelligence (AI), however, is – if at all – largely providing explanations of decisions in a visual or textual manner. While such approaches are good for communication via visual media such as in research papers or screens of intelligent devices, they may not always be the best way to explain; especially when the end user is not an expert. In particular, when the AI’s task is about Audio Intelligence, visual explanations appear less intuitive than audible, sonified ones. Sonification has also great potential for explainable AI (XAI) in systems that deal with non-audio data – for example, because it does not require visual contact or active attention of a user. Hence, sonified explanations of AI decisions face a challenging, yet highly promising and pioneering task. That involves incorporating innovative XAI algorithms to allow pointing back at the learning data responsible for decisions made by an AI, and to include decomposition of the data to identify salient aspects. It further aims to identify the components of the preprocessing, feature representation, and learnt attention patterns that are responsible for the decisions. Finally, it targets decision-making at the model-level, to provide a holistic explanation of the chain of
processing in typical pattern recognition problems from end-to-end. Sonified AI explanations will need to unite methods for sonification of the identified aspects that benefit decisions, decomposition and recomposition of audio to sonify which parts in the audio were responsible for the decision, and rendering attention patterns and salient feature representations audible. Benchmarking sonified XAI is challenging, as it will require a comparison against a backdrop of existing, state-of-the-art visual and textual alternatives, as well as synergistic complementation of all modalities in user evaluations. Sonified AI explanations will need to target different user groups to allow personalisation of the sonification experience for different user needs, to lead to a major breakthrough in comprehensibility of AI via hearing how decisions are made, hence supporting tomorrow’s humane AI’s trustability. Here, we introduce and motivate the general idea, and provide accompanying considerations including milestones of realisation of sonifed XAI and foreseeable risks.
A. Tran, K. Drossos, and T. Virtanen, "WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information," in proceedings of 29th European Signal Processing Conference (EUSIPCO), Aug. 23-27, Dublin, Ireland, 2021
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.
Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen, “Clotho: An Audio Captioning Dataset,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, and Xavier Serra, "COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations," in International Conference on Machine Learning (ICML), Workshop on Self-supervised learning in Audio and Speech, Jul. 17, virtually held, 2020
Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.
Pyry Pyykkönen, Styliannos I. Mimilakis, Konstantinos Drossos, and Tuomas Virtanen, "Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation," in proceedings of the 22nd IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 21-24, Tampere, Finland, 2020
Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.
Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti, "Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Aug. 24 - 28, Amsterdam, Netherlands, 2020
Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2.7%.
Emre Çakır, Konstantinos Drossos, Tuomas Virtanen, "Multi-task Regularization Based on Infrequent Classes for Audio Captioning," in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Nov. 2-3, Tokyo, Japan (full virtual), 2020
Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions: some words are very frequent but acoustically non-informative, i.e. the function words (e.g. "a", "the"), and other words are infrequent but informative, i.e. the content words (e.g. adjectives, nouns). In this paper we propose two methods to mitigate this class imbalance problem. First, in an autoencoder setting for audio captioning, we weigh each word's contribution to the training loss inversely proportional to its number of occurrences in the whole dataset. Secondly, in addition to multi-class, word-level audio captioning task, we define a multi-label side task based on clip-level content word detection by training a separate decoder. We use the loss from the second task to regularize the jointly trained encoder for the audio captioning task. We evaluate our method using Clotho, a recently published, wide-scale audio captioning dataset, and our results show an increase of 37% relative improvement with SPIDEr metric over the baseline method.
Antonio J. Muñoz-Montoro, Julio J. Carabias-Orti, Archontis Politis, and Konstantinos Drossos, "Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CNMF," in proceedings of the 22nd IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 21-24, Tampere, Finland, 2020
This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep-learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser Twin Network (MaD TwinNet), able to model long-term temporal patterns of a musical piece. The monophonic source spectrograms are used within a spatial covariance mixing model based on Complex Non-Negative Matrix Factorization (CNMF) that predicts the spatial characteristics of each source. The proposed framework is evaluated on the task of singing voice separation with a large multichannel dataset. Experimental results show that our joint DL+CNMF method outperforms both the individual monophonic DL-based separation and the multichannel CNMF baseline methods.
Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen, “Sound event detection via dilated convolutional recurrent neural networks,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9%, 6.3% and 2.5% at F1 score and a maximum decrease of 1.7%, 4.1% and 3.9% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.
Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen “Sound Event Detection with Depthwise Separable and Dilated Convolutions,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 19–24, Glasgow, Scotland, 2020
State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions. Additionally, the usually long sequences that are used as an input to an SED method along with the employment of an RNN, introduce implications like increased training time, difficulty at gradient flow, and impeding the parallelization of the SED method. To tackle all these problems, we propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions. We compare the proposed method to a baseline convolutional neural network on a SED task, and achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively.
Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen, "Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning," in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Nov. 2-3, Tokyo, Japan (full virtual), 2020
Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.
Stylianos I. Mimilakis, Konstantinos Drossos, Gerald Schuller, "Unsupervised Interpretable Representation Learning for Singing Voice Separation," in proceedings of the 28th European Signal Processing Conference (EUSIPCO), Jan. 18 - 22 (2021), Amsterdam, Netherlands, 2020
In this work, we present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. To demonstrate the benefits of our method, we employ the obtained representations to the task of informed singing voice separation via binary masking, and measure the obtained separation quality by means of scale-invariant signal to distortion ratio. Our findings suggest that our method is capable of learning meaningful representations for singing voice separation, while preserving conveniences of the the short-time Fourier transform like non-negativity, smoothness, and reconstruction subject to time-frequency masking, that are desired in audio and music source separation.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Estefanía Cano, and Geralrd Schuller, “Examining the Mapping Functions of Denoising Autoencoders in Music Source Separation,” in IEEE/ACM Transaction on Audio, Speech, and Language Processing (TASLP), vol. 28, pp 262-278, 2019.
The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures.
Samuel Lipping, Konstantinos Drossos, and Tuomas Virtanen, “Crowdsourcing a Dataset of Audio Captions,” in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019
Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.
Konstantinos Drossos, Shayan Gharib, Paul Magron, and Tuomas Virtanen, “Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling,” in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019
A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 6% and 3% at the F1 (higher is better) and a decrease of 3% and 2% at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 10% at F1 score and an increase of 11% at ER for the TUT-SED Synthetic 2016 dataset.
Konstantinos Drossos, Paul Magron, and Tuomas Virtanen, “Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification,” accepted for publication at the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 20–23, N. Paltz, NY, U.S.A., 2019
A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32% to 45%, using the TUT Acoustic Scenes dataset.
Stylianos Ioannis Mimilakis, Estafanía Cano, Derry Fitzgerald, Konstantinos Drossos, and Gerald Schuller, “Examining the Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation,” in proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, Oct. 28–31, Pacific Grove, CA, U.S.A., 2018
In this study, we examine the effect of various objective functions used to optimize the recently proposed deep learning architecture for singing voice separation MaD - Masker and Denoiser. The parameters of the MaD architecture are optimized using an objective function that contains a reconstruction criterion between predicted and true magnitude spectra of the singing voice, and a regularization term. We examine various reconstruction criteria such as the generalized Kullback-Leibler, mean squared error, and noise to mask ratio. We also explore recently proposed, for optimizing MaD, regularization terms such as sparsity and TwinNetwork regularization. Results from both objective assessment and listening tests suggest that the TwinNetwork regularization results in improved singing voice separation quality.
Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery,” in proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 17–20, Tokyo, Japan, 2018
Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-time Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitry Serdyuk, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 8–13, Rio de Janeiro, Brazil, 2018
Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time- Frequency Mask,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 15–20, Calgary, Canada, 2018
Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.
Paul Magron, Konstantinos Drossos, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation,” in proceedings of the INTERSPEECH 2018, Sep. 2–6, Hyderabad, India, 2018
State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-time Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency, a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.
Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitry Serdyuk, and Tuomas Virtanen, “Unsupervised adversarial domain adaptation for acoustic scene classification,” in proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Non. 19–20, Surrey, U.K., 2018
A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼10% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “A Recurrent Encoder-Decoder Approach with Skip-Filtering Connections for Monaural Singing Voice Separation, ” in proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 25–28, Tokyo, Japan, 2017
The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.
Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen, “Automated Audio Captioning with Recurrent Neural Networks,” in proceedings of the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 15–18, New Paltz, N.Y. U.S.A., 2017.
We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, Tuomas Virtanen, and Gerald Schuller, “Close Miking Empirical Practice Verification: A Source Separation Approach”, in proceedings of the 142nd Audio Engineering Society (AES) Convention, May 20–23, Berlin, Germany, 2017
Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we calculate the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording.
Emre Çakir, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, and Tuomas Virtanen, “Convolutional Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge.
Sharath Adavanne, Konstantinos Drossos, Emre Çakir, and Tuomas Virtanen, “Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.
Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, and Roman Jarina, “Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition”, in proceedings of the 14th Sound and Music Computing (SMC) conference, Jul. 5–8, Helsinki, Finland, 2017
This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the “MediaEval2015 emotion in music” dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset.
Konstantinos Drossos, Maximos Kaliakatsos-Papakostas, Andreas Floros, and Tuomas Virtanen, “On the Impact of the Semantic Content of Sound Events in Emotion Elicitation,” Journal of Audio Engineering Society, Vol. 64, No. 7/8, pp. 525–532, 2016
Sound events are proven to have an impact on the emotions of the listener. Recent works on the field of emotion recognition from sound events show, on one hand, the possibility of automatic emotional information retrieval from sound events and, on the other hand, the need for deeper understanding of the significance of the sound events’ semantic content on listener’s affective state. In this work we present a first, to the best of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal-Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “Deep Neural Networks for Dynamic Range Compression in Mastering Applications”, in proceedings of the 140th Audio Engineering Society (AES) Convention, Jul. 4–7, Paris, France, 2016
The process of audio mastering often, if not always, includes various audio signal processing techniques such as frequency equalization and dynamic range compression. With respect to the genre and style of the audio content, the parameters of these techniques are controlled by a mastering engineer, in order to process the original audio material. This operation relies on musical and perceptually pleasing facets of the perceived acoustic characteristics, transmitted from the audio material under the mastering process. Modeling such dynamic operations, which involve adaptation regarding the audio content, becomes vital in automated applications since it significantly affects the overall performance. In this work we present a system capable of modelling such behavior focusing on the automatic dynamic range compression. It predicts frequency coefficients that allow the dynamic range compression, via a trained deep neural network, and applies them to unmastered audio signal served as input. Both dynamic range compression and the prediction of the corresponding frequency coefficients take place inside the time-frequency domain, using magnitude spectra acquired from a critical band filter bank, similar to humans’ peripheral auditory system. Results from conducted listening tests, incorporating professional music producers and audio mastering engineers, demonstrate on average an equivalent performance compared to professionally mastered audio content. Improvements were also observed when compared to relevant and commercial software.
Konstantinos Drossos, Maximos Kaliakatsos-Papakostas, and Andreas Floros, “Affective Audio Synthesis for Sound Experience Enhancement”, Experimental Multimedia Systems for Interactivity and Strategic Inovation, I. Deliyannis, P. Kostagiolas (Eds), IGI-Global, 2016
With the advances of technology, multimedia tend to be a recurring and prominent component in almost all forms of communication. Although their content spans in various categories, there are two protuberant channels that are used for information conveyance, i.e. audio and visual. The former can transfer numerous content, ranging from low-level characteristics (e.g. spatial location of source and type of sound producing mechanism) to high and contextual (e.g. emotion). Additionally, recent results of published works depict the possibility for automated synthesis of sounds, e.g. music and sound events. Based on the above, in this chapter the authors propose the integration of emotion recognition from sound with automated synthesis techniques. Such a task will enhance, on one hand, the process of computer driven creation of sound content by adding an anthropocentric factor (i.e. emotion) and, on the other, the experience of the multimedia user by offering an extra constituent that will intensify the immersion and the overall user experience level.
Konstantinos Drossos, Andreas Floros, and Katia Kermanidis, “Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence”, Journal of Audio Engineering Society, Vol. 63, No. 3, pp. 139–153, 2015
While modern sound researchers generally focus on speech and music, mammalian hearing arose from the need to sense those events in the environment that produced sound waves. Such unorganized sound stimuli, referred to as Sound Events (SEs), can also produce an affective and emotional response. In this research, the investigators explore valence recognition of SEs utilizing rhythm-related acoustics cues. A well-known data set with emotionally annotated SEs was employed; various rhythm-related attributes were then extracted and several machine-learning experiments were conducted. The results portray that the rhythm of a SE can affect the listener’s valence up to an extent and, combined with previous works on SEs, could lead to a comprehensive recognition of the rhythm’s effect on the emotional state of the listener.
Konstantinos Drossos, Andreas Floros, Andreas Giannakoulopoulos, and Nikolaos Kanellopoulos, “Investigating the Impact of Sound Angular Position on the Listener Affective State”, IEEE Transactions on Affective Computing, Vol. 6, No. 1, pp. 27–42, 2015
Emotion recognition from sound signals represents an emerging field of recent research. Although many existingworks focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition fromgeneral sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the sourcerelatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and theelicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener’s emotional state, modeled in the well–established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.
Konstantinos Drossos, Nikolaos Zormpas, George Giannakopoulos, and Andreas Floros, “Accessible Games for Blind Children, Empowered by Binaural Sound," in proceedings of the 8th Pervasive Technologies Related to Assistive Environments (PETRA) Conference, Jul. 1–3, Corfu, Greece, 2015
Accessible games have been researched and developed for many years, however, blind people still have very limited access and knowledge of them. This can pose a serious limitation, especially for blind children, since in recent years electronic games have become one of the most common and wide spread means of entertainment and socialization. For our implementation we use binaural technology which allows the player to hear and navigate the game space by adding localization information to the game sounds. With our implementation and user studies we provide insight on what constitutes an accessible game for blind people as well as a functional game engine for such games. The game engine developed allows the quick development of games for the visually impaired. Our work provides a good starting point for future developments on the field and, as the user studies show, was very well perceived by the visually impaired children that tried it.
Konstantinos Drossos, Andreas Floros, and Nikolaos Kanellopoulos, “A Loudness-based Adaptive Equalization Technique for Subjectively Improved Sound Reproduction," in proceedings of the Audio Engineering Society (AES) 136th convention, Apr. 26–29, Berlin, Germany, 2014.
Sound equalization is a common approach for objectively or subjectively defining the reproduction level at specific frequency bands. It is also well-known that the human auditory system demonstrates an inner process of sound-weighting. Due to this, the perceived loudness changes with the frequency and the user-defined sound reproduction gain, resulting into a deviation of the intended and the perceived equalization scheme as the sound level changes. In this work we introduce a novel equalization approach that takes into account the above perceptual loudness effect in order to achieve subjectively constant equalization. A series of listening tests shows that the proposed equalization technique is an efficient and listener-preferred alternative for both professional and home audio reproduction applications.
Konstantinos Drossos, Andreas Floros, Stelios Potirakis, Nikolas-Alexander Tatlas, and Gurkan Tuna, “A socially-intelligent multirobot service team for in-home monitoring," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.
The objective of this study is to develop a socially-intelligent service team comprised of multiple robots with sophisticated sonic interaction capabilities that aims to transparently collaborate towards efficient and robust monitoring by close interaction. In the distributed scenario proposed in this study, the robots share any acoustic data extracted from the environment and act in-sync with the events occurring in their living environment in order to provide potential means for efficient monitoring and decision-making within a typical home enclosure. Although each robot acts as an individual recognizer using a novel emotionally-enriched word recognition system, the final decision is social in nature and is followed by all. Moreover, the social decision stage triggers actions that are algorithmically distributed among the robots' population and enhances the overall approach with the potential advantages of the team work within specific communities through collaboration.
Konstantinos Drossos, Andreas Floros, and Andreas Giannakoulopoulos, “BEADS: A Dataset of Binaural Emotionally Annotated Digital Sounds," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.
Emotion recognition from generalized sounds is an interdisciplinary and emerging field of research. A vital requirement for this kind of investigations is the availability of ground truth datasets. Currently, there are 2 freely available datasets of emotionally annotated sounds, which, however, do not include sound evenets (SEs) with manifestation of the spatial location of the source. The latter is an inherent natural component of SEs, since all sound sources in real-world conditions are physically located and perceived somewhere in the listener's surrounding space. In this work we present a novel emotionally annotated sounds dataset consisting of 32 SEs that are spatially rendered using appropriate binaural processing. All SEs in the dataset are available in 5 spatial positions corresponding to source/receiver angles equal to 0, 45, 90, 135 and 180 degrees. We have used the IADS dataset as the initial collection of SEs prior to binaural processing. The annotation measures obtained for the novel binaural dataset demonstrate a significant accordance with the existing IADS dataset, while small ratings declinations illustrate a perceptual adaptation imposed by the more realistic SEs spatial representation.
Maximos Kaliakatsos–Papakostas, Andreas Floros, Konstantinos Drossos, Konstantinos Koukoudis, Manolis Kuzalas, and Achileas Kalantzis, “Swarm Lake: A Game of Swarm Intelligence, Human Interaction and Collaborative Music Composition," in proceedings of the Joint Conference ICMC/SMC 2014, Sep. 14–20, Athens, Greece, 2014.
In this work we aim to combine a game platform with the concept of collaborative music synthesis. We use bio-inspired intelligence for developing a world - the Lake - where multiple tribes of artiﬁcial, autonomous agents live within, having survival as their ultimate goal. The tribes exhibit primitive social swarm-based behavior and intelligence, which is used for taking actions that will potentially allow to dominate the game world. Tribes’ populations also demonstrate a number of physical properties that re-strict their ability to act illimitably. Multiuser interventionis employed in parallel, affecting the automated decisions and the physical parameters of the tribes, thus infusing the gaming orientation of the application context. Finally,sound synthesis is achieved through a complex mapping scheme established between the events occurring in the Lake and the rhythmic, harmonic and dynamic-range parameters of an advanced, collaborative sound composition engine. This complex mapping scheme allows the production of interesting and complicated sonic patterns that fol-low the performance evolution in both objective and conceptual levels. The overall synthesis process is controlled by the conductor, a virtual entity that determines the synthesis evolution in a way that is very similar to directing an ensemble performance in real world.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Andreas Floros, and Dionisios Katerelos, “Automated Tonal Balance Enhancement for Audio Mastering Applications”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013
Modern audio mastering procedures are involved with the selective enhancement or attenuation of specific frequency bands. The main reason is the tonal enhancement of the original / unmastered audio material. The aforementioned process is mostly based on the musical information and the mode of the audio material. This information can be retrieved from a listening procedure of the original stimuli, or the correspondent musical key notes. The current work presents an adaptive and automated equalization system that performs the aforementioned mastering procedure, based on a novel method of fundamental frequency tracking. In addition to this, the overall system is being evaluated with objective PEAQ analysis and subjective listening tests in real mastering audio conditions.
Konstantinos Drossos, Konstantinos Koukoudis, and Andreas Floros, “Gestural User Interface for Audio Multitrack Real-time Stereo Mixing," in proceedings of the 8th Conference on Interaction with Sound - Audio Mostly 2013, Sep. 18–20, Piteå, Sweden, 2013
Sound mixing is a well-established task applied (directly or indirectly) in many fields of music and sound production. For example, in the case of classical music orchestras, their conductors perform sound mixing by specifying the reproduction gain of specific groups of musical instruments or of the entire orchestra. Moreover, modern sound artists and performers also employ sound mixing when they compose music or improvise in real-time. In this work a system is presented that incorporates a gestural interface for real-time multitrack sound mixing. The proposed gestural sound mixing control scheme is implemented on an open hardware micro-controller board, using common sensor modules. The gestures employed are as close as possible to the ones particularly used by the orchestra conductors. The system overall performance is also evaluated in terms of the achieved user experience through subjective tests.
Konstantinos Drossos, Rigas Kotsakis, Panos Pappas, George Kalliris, and Andreas Floros, “Investigating Auditory Human-Machine Interaction: Analysis and Classification of Sounds Commonly Used by Consumer Devices”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013
Many common consumer devices use a short sound indication for declaring various modes of their functionality, such as the start and the end of their operation. This is likely to result in an intuitive auditory human-machine interaction, imputing a semantic content to the sounds used. In this paper we investigate sound patterns mapped to "Start" and "End" of operation manifestations and explore the possibility such semantics’ perception to be based either on users’ prior auditory training or on sound patterns that naturally convey appropriate information. To this aim, listening and machine learning tests were conducted. The obtained results indicate a strong relation between acoustic cues and semantics along with no need of prior knowledge for message conveyance.
Konstantinos Drossos, Rigas Kotsakis, George Kalliris, and Andreas Floros, “Sound Events and Emotions: Investigating the Relation of Rhythmic Characteristics and Arousal”, in proceedings of the 4th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA 2013), Jul. 10–12, Piraeus, Greece, 2013
A variety of recent researches in Audio Emotion Recognition (AER) outlines high performance and retrieval accuracy results. However, in most works music is considered as the original sound content that conveys the identified emotions. One of the music characteristics that is found to represent a fundamental means for conveying emotions are the rhythm-related acoustic cues. Although music is an important aspect of everyday life, there are numerous non-linguistic and nonmusical sounds surrounding humans, generally defined as sound events (SEs). Despite this enormous impact of SEs to humans, a scarcity of investigations regarding AER from SEs is observed. There are only a few recent investigations concerned with SEs and AER, presenting a semantic connection between the former and the listener's triggered emotion. In this work we analytically investigate the connection of rhythm-related characteristics of a wide range of common SEs with the arousal of the listener using sound events with semantic content. To this aim, several feature evaluation and classification tasks are conducted using different ranking and classification algorithms. High accuracy results are obtained, demonstrating a significant relation of SEs rhythmic characteristics to the elicited arousal.
Konstantinos Drossos, Andreas Floros, and Nikolaos Kanellopoulos, “Affective Acoustic Ecology: Towards Emotionally Enhanced Sound Events”, in proceedings of the 7th Conference on Interaction with Sound - Audio Mostly 2012, Sep. 26 – 28, Corfu, Greece, 2012
Sound events can carry multiple information, related to the sound source and to ambient environment. However, it is well-known that sound evokes emotions, a fact that is verified by works in the disciplines of Music Emotion Recognition and Music Information Retrieval that focused on the impact of music to emotions. In this work we introduce the concept of affective acoustic ecology that extends the above relation to the general concept of sound events. Towards this aim, we define sound event as a novel audio structure with multiple components. We further investigate the application of existing emotion models employed for music affective analysis to sonic, non-musical, content. The obtained results indicate that although such application is feasible, no significant trends and classification outcomes are observed that would allow the definition of an analytic relation between the technical characteristics of a sound event waveform and raised emotions.
Konstantinos Drossos, Andreas Floros, Kyriakos Agavanakis, Nikolas-Alexander Tatlas and Nikolaos Kanellopoulos, “Emergency Voice/Stress - level Combined Recognition for Intelligent House Applications”, in proceedings of the 132nd Audio Engineering Convention, Apr. 26–29, Budapest, Hungary, 2012
Legacy technologies for word recognition can benefit from emerging affective voice retrieval, potentially leading to intelligent applications for smart houses enhanced with new features. In this work we introduce the implementation of a system, capable to react to common spoken words, taking into account the estimated vocal stress level, thus allowing the realization of a prioritized, affective aural interaction path. Upon the successful word recognition and the corresponding stress level estimation, the system triggers particular affective-prioritized actions, defined within the application scope of an intelligent home environment. Application results show that the established affective interaction path significantly improves the ambient intelligence provided by an affective vocal sensor that can be easily integrated with any sensor-based home monitoring system.
Dionisios Katerelos, Konstantinos Drossos, Anastasios Kokkinos, and Stylianos Ioannis Mimilakis, “iReflectors - Intelligent Reflectors from Composite Materials”, in proceedings of the 6th Greek National Conference Acoustics 2012, Oct. 8–10, Corfu, Greece, 2012
The use of reflectors for the optimal sound diffusion is a major issue in Room Acoustics. Up to now, the applied reflectors are stable, with certain shape and made by conventional materials. In the present is studied the possibility to replace the conventional reflectors by new, manufactured by composite materials. The aim is to design flexible “intelligent” reflectors that will adapt their shape depending on the certain acoustical needs of a room. This change is planned to be actuated using embedded shape memory alloy (SMA) wires. The adaptation process will be controlled automatically by an electronic system. In order to control damage initiation and growth within the composite panel, an optical fibres network will be applied.
Elias Kokkinis, Konstantinos Drossos, Nikolas-Alexander Tatlas, Andreas Floros, Alexandros Tsilfidis and Kyriakos Agavanakis, “Smart microphone sensor system platform”, in proceedings of the 132nd Audio Engineering Society Convention, Apr. 26–29, Budapest, Hungary, 2012
A platform for a flexible, smart microphone system using available hardware components is presented. Three subsystems are employed, specifically: (a) a set of digital MEMs microphones, with a one-bit serial output; (b) a preprocessing/digital-to-digital converter; and (c) a CPU/DSP-based embedded system with I2S connectivity. Basic preprocessing functions, such as noise gating and filtering can be performed in the preprocessing stage, while application-specific algorithms such as word spotting, beam-forming, and reverberation suppression can be handled by the embedded system. Widely used high-level operating systems are supported including drivers for a number of peripheral devices. Finally, an employment scenario for a wireless home automation speech activated front-end sensor system using the platform is analyzed.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Andreas Floros, and Nikolaos Kanellopoulos, “Stereo Goes Mobile: Spatial Enhancement for Short-distance Loudspeaker Setups”, in proceedings of the 8th International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), Jul. 18–20, Piraeus, Greece, 2012
Modern mobile, hand-held devices offer enhanced capabilities for video and sound reproduction. Nevertheless, major restrictions imposed by their limited size render them inconvenient for headset-free stereo sound reproduction, since the corresponding short-distant loudspeakers placement physically narrows the perceived stereo sound localization potential. In this work, we aim at evaluating a spatial enhancement technique for small-size mobile devices. This technique extracts the original panning information from an original stereo recording and spatially extends it using appropriate binaural rendering. A sequence of subjective tests performed shows that the derived spatial perceptual impression is significantly improved in all test cases considered, rendering the proposed technique an attractive approach towards headset-free mobile audio reproduction.
Vassilis Psarras, Andreas Floros, Konstantinos Drossos, and Marianne Strapatsakis, “Emotional Control and Visual Representation Using Advanced Audiovisual Interaction”, International Journal of Arts and Technology, Vol. 4 (4), 2011, pp. 480-498
Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations with non-traditional interaction/feedback paths based on user affective state. In this work, the ‘Elevator’ interactive audiovisual platform prototype is presented, which aims to provide a framework for signalling and expressing human behaviour related to emotions (such as anger) and finally produce a visual outcome of this behaviour, defined here as the emotional ‘thumbnail’ of the user. Optimised, real-time audio signal processing techniques are employed for monitoring the achieved anger-like behaviour, while the emotional elevation is attempted using appropriately selected combined audio/visual content reproduced using state-of-the-art audiovisual playback technologies that allow the creation of a realistic immersive audiovisual environment. The demonstration of the proposed prototype has shown that affective interaction is possible, allowing the further development of relative artistic and technological applications.
Nikolas Grigoriou, Andreas Floros, and Konstantinos Drossos, “Binaural Mixing Using Gestural Control Interaction”, in proceedings of the 5th Conference on Interaction with Sound - Audio Mostly 2010, Sep. 15–17, Piteå, Sweden, 2010
In this work a novel audio binaural mixing platform is presented which employs advanced gestural-based interaction techniques for controlling the mixing parameters. State-of-the-art binaural technology algorithms are used for producing the final two-channel binaural signal. These algorithms are optimized for realtime operation, able to manipulate high-quality audio (typically 24bit / 96kHz) for an arbitrary number of fixed-position or moving sound sources in closed acoustic enclosures. Simple gestural rules are employed, which aim to provide the complete functionality required for the mixing process, using low cost equipment. It is shown that the proposed platform can be efficiently used for general audio mixing / mastering purposes, providing an attractive alternative to legacy hardware control designs and software-based mixing user interfaces.
Panagiotis Vlamos, Andreas Floros, Michail Giannakos, Konstantinos Drossos, “Towards an Interactive e-Learning System Based on Emotion and Affective Cognition”, in proceedings of the International Conference on Information Communication Technologies and Education (ICICTE), Jul. 8–10, Corfu, Greece, 2012, pp 367–376
In order to promote a more dynamic and flexible communication between the learner and the system, we present a structure of a new innovative and interactive e-learning system which implements emotion and level of cognition recognition. The system has as inputs the emotional and cognitive state of the user and re-organises the content and adjusts the flow of the course. Our concept aims to increase the learning efficiency of intelligent tutoring systems by using a combination of characteristics, such as content customization and user’s emotion recognition, and adapting all these features into a learner-centered educational system.
Timothy Mellow, Olga Umnova, Konstantinos Drossos, Keith Holland, Andrew Flewitt and Leo Kärkkäinen, “On The Adsorption - Desorption Relaxation Time Of Carbon In Very Narrow Ducts”, in proceedings of the Acoustics 08 conference, Jun. 29–Jul. 04, Paris, France, 2008
Loudspeakers generally have boxes to prevent rear wave cancellation at low frequencies. However, the stiffness of the air in a small box reduces the diaphragm’s excursion at low frequencies. Hence the box size is generally a compromise between low frequency performance and practicality. Activated carbon has been found to increase the apparent size of a given box through adsorption of the air molecules when the pressure increases and likewise desorption when it decreases. However, the exact viscous effects in the granular structure are difficult to model. Thus it is impossible determine the high frequency limit due to the natural adsorption/desorption relaxation time in the absence of viscous losses. In this study, a tube model is presented which takes into account viscous and thermal losses with boundary slip together with adsorption. Impedance measurements are performed on an array of 12 million holes, each 2 micrometers in diameter, etched in a 0.25 mm thick silicon wafer so that the viscous and thermal losses can be verified against the model without adsorption. Impedance measurements are then performed on an array of holes coated with graphite in order to create an activated carbon- like structure, thus enabling the adsorption/desorption relaxation time to be evaluated.