Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen, “Clotho: An Audio Captioning Dataset,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
Yanxiong Li, Mingle Liu, Konstantinos Drossos, Tuomas Virtanen, “Sound event detection via dilated convolutional recurrent neural networks,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 04–08, Barcelona, Spain, 2020
Convolutional recurrent neural networks (CRNNs) have achieved state-of-the-art performance for sound event detection (SED). In this paper, we propose to use a dilated CRNN, namely a CRNN with a dilated convolutional kernel, as the classifier for the task of SED. We investigate the effectiveness of dilation operations which provide a CRNN with expanded receptive fields to capture long temporal context without increasing the amount of CRNN's parameters. Compared to the classifier of the baseline CRNN, the classifier of the dilated CRNN obtains a maximum increase of 1.9%, 6.3% and 2.5% at F1 score and a maximum decrease of 1.7%, 4.1% and 3.9% at error rate (ER), on the publicly available audio corpora of the TUT-SED Synthetic 2016, the TUT Sound Event 2016 and the TUT Sound Event 2017, respectively.
Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen “Sound Event Detection with Depthwise Separable and Dilated Convolutions,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 19–24, Glasgow, Scotland, 2020
State-of-the-art sound event detection (SED) methods usually employ a series of convolutional neural networks (CNNs) to extract useful features from the input audio signal, and then recurrent neural networks (RNNs) to model longer temporal context in the extracted features. The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions. Additionally, the usually long sequences that are used as an input to an SED method along with the employment of an RNN, introduce implications like increased training time, difficulty at gradient flow, and impeding the parallelization of the SED method. To tackle all these problems, we propose the replacement of the CNNs with depthwise separable convolutions and the replacement of the RNNs with dilated convolutions. We compare the proposed method to a baseline convolutional neural network on a SED task, and achieve a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction of the average error rate by 4.6% and 3.8%, respectively.
S. I. Mimilakis, K. Drossos, E. Cano, and G. Schuller, “Examining the Mapping Functions of Denoising Autoencoders in Music Source Separation,” in IEEE/ACM Transaction on Audio, Speech, and Language Processing (TASLP), vol. 28, pp 262-278, 2019.
The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures.
Samuel Lipping, Konstantinos Drossos, and Tuomas Virtanen, “Crowdsourcing a Dataset of Audio Captions,” in proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019
Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.
K. Drossos, S. Gharib, P. Magron, and T. Virtanen, “Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling,” submitted at the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Oct. 26–27, New York, NY, U.S.A., 2019
A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" will likely be followed by a "car passing by". While this temporal structure is widely exploited in sequence prediction tasks (e.g., in machine translation), where language models (LM) are exploited, it is not satisfactorily modeled in SED. In this work we propose a method which allows a recurrent neural network (RNN) to learn an LM for the SED task. The method conditions the input of the RNN with the activities of classes at the previous time step. We evaluate our method using F1 score and error rate (ER) over three different and publicly available datasets; the TUT-SED Synthetic 2016 and the TUT Sound Events 2016 and 2017 datasets. The obtained results show an increase of 6% and 3% at the F1 (higher is better) and a decrease of 3% and 2% at ER (lower is better) for the TUT Sound Events 2016 and 2017 datasets, respectively, when using our method. On the contrary, with our method there is a decrease of 10% at F1 score and an increase of 11% at ER for the TUT-SED Synthetic 2016 dataset.
K. Drossos, P. Magron, and T. Virtanen, “Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification,” accepted for publication at the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 20–23, N. Paltz, NY, U.S.A., 2019
A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32% to 45%, using the TUT Acoustic Scenes dataset.
Stylianos Ioannis Mimilakis, Estafanía Cano, Derry Fitzgerald, Konstantinos Drossos, and Gerald Schuller, “Examining the Perceptual Effect of Alternative Objective Functions for Deep Learning Based Music Source Separation,” in proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, Oct. 28–31, Pacific Grove, CA, U.S.A., 2018
In this study, we examine the effect of various objective functions used to optimize the recently proposed deep learning architecture for singing voice separation MaD - Masker and Denoiser. The parameters of the MaD architecture are optimized using an objective function that contains a reconstruction criterion between predicted and true magnitude spectra of the singing voice, and a regularization term. We examine various reconstruction criteria such as the generalized Kullback-Leibler, mean squared error, and noise to mask ratio. We also explore recently proposed, for optimizing MaD, regularization terms such as sparsity and TwinNetwork regularization. Results from both objective assessment and listening tests suggest that the TwinNetwork regularization results in improved singing voice separation quality.
Konstantinos Drossos, Paul Magron, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery,” in proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Sep. 17–20, Tokyo, Japan, 2018
Harmonic/percussive source separation (HPSS) consists in separating the pitched instruments from the percussive parts in a music mixture. In this paper, we propose to apply the recently introduced Masker-Denoiser with twin networks (MaD TwinNet) system to this task. MaD TwinNet is a deep learning architecture that has reached state-of-the-art results in monaural singing voice separation. Herein, we propose to apply it to HPSS by using it to estimate the magnitude spectrogram of the percussive source. Then, we retrieve the complex-valued short-time Fourier transform of the sources by means of a phase recovery algorithm, which minimizes the reconstruction error and enforces the phase of the harmonic part to follow a sinusoidal phase model. Experiments conducted on realistic music mixtures show that this novel separation system outperforms the previous state-of-the art kernel additive model approach.
Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitry Serdyuk, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation,” in proceedings of the IEEE World Congress on Computational Intelligence/International Joint Conference on Neural Networks (WCCI/IJCNN), Jul. 8–13, Rio de Janeiro, Brazil, 2018
Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, and Yoshua Bengio, “Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time- Frequency Mask,” in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 15–20, Calgary, Canada, 2018
Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.
Paul Magron, Konstantinos Drossos, Stylianos Ioannis Mimilakis, and Tuomas Virtanen, “Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation,” in proceedings of the INTERSPEECH 2018, Sep. 2–6, Hyderabad, India, 2018
State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-time Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency, a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.
Shayan Gharib, Konstantinos Drossos, Emre Çakir, Dmitry Serdyuk, and Tuomas Virtanen, “Unsupervised adversarial domain adaptation for acoustic scene classification,” in proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Non. 19–20, Surrey, U.K., 2018
A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy. As a countermeasure, we present the first method of unsupervised adversarial domain adaptation for acoustic scene classification. We employ a model pre-trained on data from one set of conditions and by using data from other set of conditions, we adapt the model in order that its output cannot be used for classifying the set of conditions that input data belong to. We use a freely available dataset from the DCASE 2018 challenge Task 1, subtask B, that contains data from mismatched recording devices. We consider the scenario where the annotations are available for the data recorded from one device, but not for the rest. Our results show that with our model agnostic method we can achieve ∼10% increase at the accuracy on an unseen and unlabeled dataset, while keeping almost the same performance on the labeled dataset.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, and Gerald Schuller, “A Recurrent Encoder-Decoder Approach with Skip-Filtering Connections for Monaural Singing Voice Separation, ” in proceedings of the 27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 25–28, Tokyo, Japan, 2017
The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.
Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen, “Automated Audio Captioning with Recurrent Neural Networks,” in proceedings of the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 15–18, New Paltz, N.Y. U.S.A., 2017.
We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
K. Drossos, S. Mimilakis, A. Floros, T. Virtanen, and G. Schuller, “Close Miking Empirical Practice Verification: A Source Separation Approach”, in proceedings of the 142nd Audio Engineering Society (AES) Convention, May 20–23, Berlin, Germany, 2017
Close miking represents a widely employed practice of placing a microphone very near to the sound source in order to capture more direct sound and minimize any pickup of ambient sound, including other, concurrently active sources. It is used by the audio engineering community for decades for audio recording, based on a number of empirical rules that were evolved during the recording practice itself. But can this empirical knowledge and close miking practice be systematically verified? In this work we aim to address this question based on an analytic methodology that employs techniques and metrics originating from the sound source separation evaluation field. In particular, we apply a quantitative analysis of the source separation capabilities of the close miking technique. The analysis is applied on a recording dataset obtained at multiple positions of a typical musical hall, multiple distances between the microphone and the sound source multiple microphone types and multiple level differences between the sound source and the ambient acoustic component. For all the above cases we calculate the Source to Interference Ratio (SIR) metric. The results obtained clearly demonstrate an optimum close-miking performance that matches the current empirical knowledge of professional audio recording.
Emre Çakir, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, and Tuomas Virtanen, “Convolutional Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge.
Sharath Adavanne, Konstantinos Drossos, Emre Çakir, and Tuomas Virtanen, “Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection,” in proceedings of the 25th European Signal Processing Conference (EUSIPCO), Aug. 28–Sep. 2, Kos, Greece, 2017
This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.
M. Malik, S. Adavanne, K. Drossos, T. Virtanen, D. Ticha, and R. Jarina, “Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition”, in proceedings of the 14th Sound and Music Computing (SMC) conference, Jul. 5–8, Helsinki, Finland, 2017
This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the “MediaEval2015 emotion in music” dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset.
K. Drossos, M. Kaliakatsos-Papakostas, A. Floros, and T. Virtanen, “On the Impact of the Semantic Content of Sound Events in Emotion Elicitation,” Journal of Audio Engineering Society, Vol. 64, No. 7/8, pp. 525–532, 2016
Sound events are proven to have an impact on the emotions of the listener. Recent works on the field of emotion recognition from sound events show, on one hand, the possibility of automatic emotional information retrieval from sound events and, on the other hand, the need for deeper understanding of the significance of the sound events’ semantic content on listener’s affective state. In this work we present a first, to the best of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal-Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.
S. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, “Deep Neural Networks for Dynamic Range Compression in Mastering Applications”, in proceedings of the 140th Audio Engineering Society (AES) Convention, Jul. 4–7, Paris, France, 2016
The process of audio mastering often, if not always, includes various audio signal processing techniques such as frequency equalization and dynamic range compression. With respect to the genre and style of the audio content, the parameters of these techniques are controlled by a mastering engineer, in order to process the original audio material. This operation relies on musical and perceptually pleasing facets of the perceived acoustic characteristics, transmitted from the audio material under the mastering process. Modeling such dynamic operations, which involve adaptation regarding the audio content, becomes vital in automated applications since it significantly affects the overall performance. In this work we present a system capable of modelling such behavior focusing on the automatic dynamic range compression. It predicts frequency coefficients that allow the dynamic range compression, via a trained deep neural network, and applies them to unmastered audio signal served as input. Both dynamic range compression and the prediction of the corresponding frequency coefficients take place inside the time-frequency domain, using magnitude spectra acquired from a critical band filter bank, similar to humans’ peripheral auditory system. Results from conducted listening tests, incorporating professional music producers and audio mastering engineers, demonstrate on average an equivalent performance compared to professionally mastered audio content. Improvements were also observed when compared to relevant and commercial software.
K. Drossos, M. Kaliakatsos-Papakostas, and A. Floros, “Affective Audio Synthesis for Sound Experience Enhancement”, Experimental Multimedia Systems for Interactivity and Strategic Inovation, I. Deliyannis, P. Kostagiolas (Eds), IGI-Global, 2016
With the advances of technology, multimedia tend to be a recurring and prominent component in almost all forms of communication. Although their content spans in various categories, there are two protuberant channels that are used for information conveyance, i.e. audio and visual. The former can transfer numerous content, ranging from low-level characteristics (e.g. spatial location of source and type of sound producing mechanism) to high and contextual (e.g. emotion). Additionally, recent results of published works depict the possibility for automated synthesis of sounds, e.g. music and sound events. Based on the above, in this chapter the authors propose the integration of emotion recognition from sound with automated synthesis techniques. Such a task will enhance, on one hand, the process of computer driven creation of sound content by adding an anthropocentric factor (i.e. emotion) and, on the other, the experience of the multimedia user by offering an extra constituent that will intensify the immersion and the overall user experience level.
K. Drossos, A. Floros, and K. Kermanidis, “Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence”, Journal of Audio Engineering Society, Vol. 63, No. 3, pp. 139–153, 2015
While modern sound researchers generally focus on speech and music, mammalian hearing arose from the need to sense those events in the environment that produced sound waves. Such unorganized sound stimuli, referred to as Sound Events (SEs), can also produce an affective and emotional response. In this research, the investigators explore valence recognition of SEs utilizing rhythm-related acoustics cues. A well-known data set with emotionally annotated SEs was employed; various rhythm-related attributes were then extracted and several machine-learning experiments were conducted. The results portray that the rhythm of a SE can affect the listener’s valence up to an extent and, combined with previous works on SEs, could lead to a comprehensive recognition of the rhythm’s effect on the emotional state of the listener.
K. Drossos, A. Floros, A. Giannakoulopoulos, and N. Kanellopoulos, “Investigating the Impact of Sound Angular Position on the Listener Affective State”, IEEE Transactions on Affective Computing, Vol. 6, No. 1, pp. 27–42, 2015
Emotion recognition from sound signals represents an emerging field of recent research. Although many existingworks focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition fromgeneral sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the sourcerelatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and theelicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener’s emotional state, modeled in the well–established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.
K. Drossos, N. Zormpas, G. Giannakopoulos, and A. Floros, “Accessible Games for Blind Children, Empowered by Binaural Sound," in proceedings of the 8th Pervasive Technologies Related to Assistive Environments (PETRA) Conference, Jul. 1–3, Corfu, Greece, 2015
Accessible games have been researched and developed for many years, however, blind people still have very limited access and knowledge of them. This can pose a serious limitation, especially for blind children, since in recent years electronic games have become one of the most common and wide spread means of entertainment and socialization. For our implementation we use binaural technology which allows the player to hear and navigate the game space by adding localization information to the game sounds. With our implementation and user studies we provide insight on what constitutes an accessible game for blind people as well as a functional game engine for such games. The game engine developed allows the quick development of games for the visually impaired. Our work provides a good starting point for future developments on the field and, as the user studies show, was very well perceived by the visually impaired children that tried it.
K. Drossos, A. Floros, and N. Kanellopoulos, “A Loudness-based Adaptive Equalization Technique for Subjectively Improved Sound Reproduction," in proceedings of the Audio Engineering Society (AES) 136th convention, Apr. 26–29, Berlin, Germany, 2014.
Sound equalization is a common approach for objectively or subjectively defining the reproduction level at specific frequency bands. It is also well-known that the human auditory system demonstrates an inner process of sound-weighting. Due to this, the perceived loudness changes with the frequency and the user-defined sound reproduction gain, resulting into a deviation of the intended and the perceived equalization scheme as the sound level changes. In this work we introduce a novel equalization approach that takes into account the above perceptual loudness effect in order to achieve subjectively constant equalization. A series of listening tests shows that the proposed equalization technique is an efficient and listener-preferred alternative for both professional and home audio reproduction applications.
K. Drossos, A. Floros, S. Potirakis, N. Tatlas, and G. Tuna, “A socially-intelligent multirobot service team for in-home monitoring," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.
The objective of this study is to develop a socially-intelligent service team comprised of multiple robots with sophisticated sonic interaction capabilities that aims to transparently collaborate towards efficient and robust monitoring by close interaction. In the distributed scenario proposed in this study, the robots share any acoustic data extracted from the environment and act in-sync with the events occurring in their living environment in order to provide potential means for efficient monitoring and decision-making within a typical home enclosure. Although each robot acts as an individual recognizer using a novel emotionally-enriched word recognition system, the final decision is social in nature and is followed by all. Moreover, the social decision stage triggers actions that are algorithmically distributed among the robots' population and enhances the overall approach with the potential advantages of the team work within specific communities through collaboration.
K. Drossos, A. Floros, and A. Giannakoulopoulos, “BEADS: A Dataset of Binaural Emotionally Annotated Digital Sounds," in proceedings of the 5th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 9–7, Chania, Greece, 2014.
Emotion recognition from generalized sounds is an interdisciplinary and emerging field of research. A vital requirement for this kind of investigations is the availability of ground truth datasets. Currently, there are 2 freely available datasets of emotionally annotated sounds, which, however, do not include sound evenets (SEs) with manifestation of the spatial location of the source. The latter is an inherent natural component of SEs, since all sound sources in real-world conditions are physically located and perceived somewhere in the listener's surrounding space. In this work we present a novel emotionally annotated sounds dataset consisting of 32 SEs that are spatially rendered using appropriate binaural processing. All SEs in the dataset are available in 5 spatial positions corresponding to source/receiver angles equal to 0, 45, 90, 135 and 180 degrees. We have used the IADS dataset as the initial collection of SEs prior to binaural processing. The annotation measures obtained for the novel binaural dataset demonstrate a significant accordance with the existing IADS dataset, while small ratings declinations illustrate a perceptual adaptation imposed by the more realistic SEs spatial representation.
M. Kaliakatsos–Papakostas, A. Floros, K. Drossos, K. Koukoudis, M. Kuzalas, and A. Kalantzis, “Swarm Lake: A Game of Swarm Intelligence, Human Interaction and Collaborative Music Composition," in proceedings of the Joint Conference ICMC/SMC 2014, Sep. 14–20, Athens, Greece, 2014.
In this work we aim to combine a game platform with the concept of collaborative music synthesis. We use bio-inspired intelligence for developing a world - the Lake - where multiple tribes of artiﬁcial, autonomous agents live within, having survival as their ultimate goal. The tribes exhibit primitive social swarm-based behavior and intelligence, which is used for taking actions that will potentially allow to dominate the game world. Tribes’ populations also demonstrate a number of physical properties that re-strict their ability to act illimitably. Multiuser interventionis employed in parallel, affecting the automated decisions and the physical parameters of the tribes, thus infusing the gaming orientation of the application context. Finally,sound synthesis is achieved through a complex mapping scheme established between the events occurring in the Lake and the rhythmic, harmonic and dynamic-range parameters of an advanced, collaborative sound composition engine. This complex mapping scheme allows the production of interesting and complicated sonic patterns that fol-low the performance evolution in both objective and conceptual levels. The overall synthesis process is controlled by the conductor, a virtual entity that determines the synthesis evolution in a way that is very similar to directing an ensemble performance in real world.
S. Mimilakis, K. Drossos, A. Floros, and D. Katerelos, “Automated Tonal Balance Enhancement for Audio Mastering Applications”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013
Modern audio mastering procedures are involved with the selective enhancement or attenuation of specific frequency bands. The main reason is the tonal enhancement of the original / unmastered audio material. The aforementioned process is mostly based on the musical information and the mode of the audio material. This information can be retrieved from a listening procedure of the original stimuli, or the correspondent musical key notes. The current work presents an adaptive and automated equalization system that performs the aforementioned mastering procedure, based on a novel method of fundamental frequency tracking. In addition to this, the overall system is being evaluated with objective PEAQ analysis and subjective listening tests in real mastering audio conditions.
K. Drossos, K. Koukoudis, and A. Floros, “Gestural User Interface for Audio Multitrack Real-time Stereo Mixing," in proceedings of the 8th Conference on Interaction with Sound - Audio Mostly 2013, Sep. 18–20, Piteå, Sweden, 2013
Sound mixing is a well-established task applied (directly or indirectly) in many fields of music and sound production. For example, in the case of classical music orchestras, their conductors perform sound mixing by specifying the reproduction gain of specific groups of musical instruments or of the entire orchestra. Moreover, modern sound artists and performers also employ sound mixing when they compose music or improvise in real-time. In this work a system is presented that incorporates a gestural interface for real-time multitrack sound mixing. The proposed gestural sound mixing control scheme is implemented on an open hardware micro-controller board, using common sensor modules. The gestures employed are as close as possible to the ones particularly used by the orchestra conductors. The system overall performance is also evaluated in terms of the achieved user experience through subjective tests.
K. Drossos, R. Kotsakis, P. Pappas, G. Kalliris, and A. Floros, “Investigating Auditory Human-Machine Interaction: Analysis and Classification of Sounds Commonly Used by Consumer Devices”, in proceedings of the 134th Audio Engineering Society Convention, May 4–7, Rome, Italy, 2013
Many common consumer devices use a short sound indication for declaring various modes of their functionality, such as the start and the end of their operation. This is likely to result in an intuitive auditory human-machine interaction, imputing a semantic content to the sounds used. In this paper we investigate sound patterns mapped to "Start" and "End" of operation manifestations and explore the possibility such semantics’ perception to be based either on users’ prior auditory training or on sound patterns that naturally convey appropriate information. To this aim, listening and machine learning tests were conducted. The obtained results indicate a strong relation between acoustic cues and semantics along with no need of prior knowledge for message conveyance.
K. Drossos, R. Kotsakis, G. Kalliris, and A. Floros, “Sound Events and Emotions: Investigating the Relation of Rhythmic Characteristics and Arousal”, in proceedings of the 4th IEEE International Conference on Information, Intelligence, Systems and Applications (IISA 2013), Jul. 10–12, Piraeus, Greece, 2013
A variety of recent researches in Audio Emotion Recognition (AER) outlines high performance and retrieval accuracy results. However, in most works music is considered as the original sound content that conveys the identified emotions. One of the music characteristics that is found to represent a fundamental means for conveying emotions are the rhythm-related acoustic cues. Although music is an important aspect of everyday life, there are numerous non-linguistic and nonmusical sounds surrounding humans, generally defined as sound events (SEs). Despite this enormous impact of SEs to humans, a scarcity of investigations regarding AER from SEs is observed. There are only a few recent investigations concerned with SEs and AER, presenting a semantic connection between the former and the listener's triggered emotion. In this work we analytically investigate the connection of rhythm-related characteristics of a wide range of common SEs with the arousal of the listener using sound events with semantic content. To this aim, several feature evaluation and classification tasks are conducted using different ranking and classification algorithms. High accuracy results are obtained, demonstrating a significant relation of SEs rhythmic characteristics to the elicited arousal.
K. Drossos, A. Floros, and N. Kanellopoulos, “Affective Acoustic Ecology: Towards Emotionally Enhanced Sound Events”, in proceedings of the 7th Conference on Interaction with Sound - Audio Mostly 2012, Sep. 26 – 28, Corfu, Greece, 2012
Sound events can carry multiple information, related to the sound source and to ambient environment. However, it is well-known that sound evokes emotions, a fact that is verified by works in the disciplines of Music Emotion Recognition and Music Information Retrieval that focused on the impact of music to emotions. In this work we introduce the concept of affective acoustic ecology that extends the above relation to the general concept of sound events. Towards this aim, we define sound event as a novel audio structure with multiple components. We further investigate the application of existing emotion models employed for music affective analysis to sonic, non-musical, content. The obtained results indicate that although such application is feasible, no significant trends and classification outcomes are observed that would allow the definition of an analytic relation between the technical characteristics of a sound event waveform and raised emotions.
K. Drossos, A. Floros, K. Agavanakis, N. Tatlas and N. Kanellopoulos, “Emergency Voice/Stress - level Combined Recognition for Intelligent House Applications”, in proceedings of the 132nd Audio Engineering Convention, Apr. 26–29, Budapest, Hungary, 2012
Legacy technologies for word recognition can benefit from emerging affective voice retrieval, potentially leading to intelligent applications for smart houses enhanced with new features. In this work we introduce the implementation of a system, capable to react to common spoken words, taking into account the estimated vocal stress level, thus allowing the realization of a prioritized, affective aural interaction path. Upon the successful word recognition and the corresponding stress level estimation, the system triggers particular affective-prioritized actions, defined within the application scope of an intelligent home environment. Application results show that the established affective interaction path significantly improves the ambient intelligence provided by an affective vocal sensor that can be easily integrated with any sensor-based home monitoring system.
D. Katerelos, K. Drossos, A. Kokkinos, and S. Mimilakis, “iReflectors - Intelligent Reflectors from Composite Materials”, in proceedings of the 6th Greek National Conference Acoustics 2012, Oct. 8–10, Corfu, Greece, 2012
The use of reflectors for the optimal sound diffusion is a major issue in Room Acoustics. Up to now, the applied reflectors are stable, with certain shape and made by conventional materials. In the present is studied the possibility to replace the conventional reflectors by new, manufactured by composite materials. The aim is to design flexible “intelligent” reflectors that will adapt their shape depending on the certain acoustical needs of a room. This change is planned to be actuated using embedded shape memory alloy (SMA) wires. The adaptation process will be controlled automatically by an electronic system. In order to control damage initiation and growth within the composite panel, an optical fibres network will be applied.
E. Kokkinis, K. Drossos, N. Tatlas, A. Floros, A. Tsilfidis and K. Agavanakis, “Smart microphone sensor system platform”, in proceedings of the 132nd Audio Engineering Society Convention, Apr. 26–29, Budapest, Hungary, 2012
A platform for a flexible, smart microphone system using available hardware components is presented. Three subsystems are employed, specifically: (a) a set of digital MEMs microphones, with a one-bit serial output; (b) a preprocessing/digital-to-digital converter; and (c) a CPU/DSP-based embedded system with I2S connectivity. Basic preprocessing functions, such as noise gating and filtering can be performed in the preprocessing stage, while application-specific algorithms such as word spotting, beam-forming, and reverberation suppression can be handled by the embedded system. Widely used high-level operating systems are supported including drivers for a number of peripheral devices. Finally, an employment scenario for a wireless home automation speech activated front-end sensor system using the platform is analyzed.
K. Drossos, S. Mimilakis, A. Floros, and N. Kanellopoulos, “Stereo Goes Mobile: Spatial Enhancement for Short-distance Loudspeaker Setups”, in proceedings of the 8th International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), Jul. 18–20, Piraeus, Greece, 2012
Modern mobile, hand-held devices offer enhanced capabilities for video and sound reproduction. Nevertheless, major restrictions imposed by their limited size render them inconvenient for headset-free stereo sound reproduction, since the corresponding short-distant loudspeakers placement physically narrows the perceived stereo sound localization potential. In this work, we aim at evaluating a spatial enhancement technique for small-size mobile devices. This technique extracts the original panning information from an original stereo recording and spatially extends it using appropriate binaural rendering. A sequence of subjective tests performed shows that the derived spatial perceptual impression is significantly improved in all test cases considered, rendering the proposed technique an attractive approach towards headset-free mobile audio reproduction.
V. Psarras, A. Floros, K. Drossos, and M. Strapatsakis, “Emotional Control and Visual Representation Using Advanced Audiovisual Interaction”, International Journal of Arts and Technology, Vol. 4 (4), 2011, pp. 480-498
Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations with non-traditional interaction/feedback paths based on user affective state. In this work, the ‘Elevator’ interactive audiovisual platform prototype is presented, which aims to provide a framework for signalling and expressing human behaviour related to emotions (such as anger) and finally produce a visual outcome of this behaviour, defined here as the emotional ‘thumbnail’ of the user. Optimised, real-time audio signal processing techniques are employed for monitoring the achieved anger-like behaviour, while the emotional elevation is attempted using appropriately selected combined audio/visual content reproduced using state-of-the-art audiovisual playback technologies that allow the creation of a realistic immersive audiovisual environment. The demonstration of the proposed prototype has shown that affective interaction is possible, allowing the further development of relative artistic and technological applications.
N. Grigoriou, A. Floros, and K. Drossos, “Binaural Mixing Using Gestural Control Interaction”, in proceedings of the 5th Conference on Interaction with Sound - Audio Mostly 2010, Sep. 15–17, Piteå, Sweden, 2010
In this work a novel audio binaural mixing platform is presented which employs advanced gestural-based interaction techniques for controlling the mixing parameters. State-of-the-art binaural technology algorithms are used for producing the final two-channel binaural signal. These algorithms are optimized for realtime operation, able to manipulate high-quality audio (typically 24bit / 96kHz) for an arbitrary number of fixed-position or moving sound sources in closed acoustic enclosures. Simple gestural rules are employed, which aim to provide the complete functionality required for the mixing process, using low cost equipment. It is shown that the proposed platform can be efficiently used for general audio mixing / mastering purposes, providing an attractive alternative to legacy hardware control designs and software-based mixing user interfaces.
P. Vlamos, A. Floros, M. Giannakos, K. Drossos, “Towards an Interactive e-Learning System Based on Emotion and Affective Cognition”, in proceedings of the International Conference on Information Communication Technologies and Education (ICICTE), Jul. 8–10, Corfu, Greece, 2012, pp 367–376
In order to promote a more dynamic and flexible communication between the learner and the system, we present a structure of a new innovative and interactive e-learning system which implements emotion and level of cognition recognition. The system has as inputs the emotional and cognitive state of the user and re-organises the content and adjusts the flow of the course. Our concept aims to increase the learning efficiency of intelligent tutoring systems by using a combination of characteristics, such as content customization and user’s emotion recognition, and adapting all these features into a learner-centered educational system.
T. Mellow, O. Umnova, K. Drossos, K. Holland, A. Flewitt and L. Kärkkäinen, “On The Adsorption - Desorption Relaxation Time Of Carbon In Very Narrow Ducts”, in proceedings of the Acoustics 08 conference, Jun. 29–Jul. 04, Paris, France, 2008
Loudspeakers generally have boxes to prevent rear wave cancellation at low frequencies. However, the stiffness of the air in a small box reduces the diaphragm’s excursion at low frequencies. Hence the box size is generally a compromise between low frequency performance and practicality. Activated carbon has been found to increase the apparent size of a given box through adsorption of the air molecules when the pressure increases and likewise desorption when it decreases. However, the exact viscous effects in the granular structure are difficult to model. Thus it is impossible determine the high frequency limit due to the natural adsorption/desorption relaxation time in the absence of viscous losses. In this study, a tube model is presented which takes into account viscous and thermal losses with boundary slip together with adsorption. Impedance measurements are performed on an array of 12 million holes, each 2 micrometers in diameter, etched in a 0.25 mm thick silicon wafer so that the viscous and thermal losses can be verified against the model without adsorption. Impedance measurements are then performed on an array of holes coated with graphite in order to create an activated carbon- like structure, thus enabling the adsorption/desorption relaxation time to be evaluated.