E. Rovithis, N. Moustakas, K. Vogklis, K. Drossos, and A. Floros, "Design Recommendations for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices," in Journal of the Audio Engineering Society, vo. 69 (12), pp. 956-966, 2021, doi:
Citizen Science aims to engage people in research activities on important issues related to their well-being. Smart Cities aim to provide them with services that improve the quality of their life. Both concepts have seen significant growth in the last years and can be further enhanced by combining their purposes with Internet of Things technologies that allow for dynamic and large-scale communication and interaction. However, exciting and retaining the interest of participants is a key factor for such initiatives. In this paper we suggest that engagement in Citizen Science projects applied on Smart Cities infrastructure can be enhanced through contextual and structural game elements realized through augmented audio interactive mechanisms. Our interdisciplinary framework is described through the paradigm of a collaborative bird call recognition game, in which users collect and submit audio data that are then classified and used for augmenting physical space. We discuss the Playful Learning, Internet of Audio Things, and Bird Monitoring principles that shaped the design of our paradigm and analyze the design issues of its potential technical implementation.
A. Ferraro, X. Favory, K. Drossos, Y. Kim and D. Bogdanov, "Enriched Music Representations with Multiple Cross-modal Contrastive Learning," in IEEE Signal Processing Letters, doi: 10.1109/LSP.2021.3071082.
Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks' audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Estefanía Cano, and Geralrd Schuller, “Examining the Mapping Functions of Denoising Autoencoders in Music Source Separation,” in IEEE/ACM Transaction on Audio, Speech, and Language Processing (TASLP), vol. 28, pp 262-278, 2019.
The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures.
Konstantinos Drossos, Maximos Kaliakatsos-Papakostas, Andreas Floros, and Tuomas Virtanen, “On the Impact of the Semantic Content of Sound Events in Emotion Elicitation,” Journal of Audio Engineering Society, Vol. 64, No. 7/8, pp. 525–532, 2016
Sound events are proven to have an impact on the emotions of the listener. Recent works on the field of emotion recognition from sound events show, on one hand, the possibility of automatic emotional information retrieval from sound events and, on the other hand, the need for deeper understanding of the significance of the sound events’ semantic content on listener’s affective state. In this work we present a first, to the best of authors’ knowledge, investigation of the relation between the semantic similarity of the sound events and the elicited emotion. For that cause we use two emotionally annotated sound datasets and the Wu-Palmer semantic similarity measure according to WordNet. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. On the contrary, when the semantic content is matched to specific areas in the Arousal-Valence space or also the source’s spatial position is taken into account, it is exhibited that the importance of the semantic content effect is higher, especially for the cases with medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head, respectively.
Konstantinos Drossos, Andreas Floros, and Katia Kermanidis, “Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence”, Journal of Audio Engineering Society, Vol. 63, No. 3, pp. 139–153, 2015
While modern sound researchers generally focus on speech and music, mammalian hearing arose from the need to sense those events in the environment that produced sound waves. Such unorganized sound stimuli, referred to as Sound Events (SEs), can also produce an affective and emotional response. In this research, the investigators explore valence recognition of SEs utilizing rhythm-related acoustics cues. A well-known data set with emotionally annotated SEs was employed; various rhythm-related attributes were then extracted and several machine-learning experiments were conducted. The results portray that the rhythm of a SE can affect the listener’s valence up to an extent and, combined with previous works on SEs, could lead to a comprehensive recognition of the rhythm’s effect on the emotional state of the listener.
Konstantinos Drossos, Andreas Floros, Andreas Giannakoulopoulos, and Nikolaos Kanellopoulos, “Investigating the Impact of Sound Angular Position on the Listener Affective State”, IEEE Transactions on Affective Computing, Vol. 6, No. 1, pp. 27–42, 2015
Emotion recognition from sound signals represents an emerging field of recent research. Although many existingworks focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition fromgeneral sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the sourcerelatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and theelicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener’s emotional state, modeled in the well–established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.
Vassilis Psarras, Andreas Floros, Konstantinos Drossos, and Marianne Strapatsakis, “Emotional Control and Visual Representation Using Advanced Audiovisual Interaction”, International Journal of Arts and Technology, Vol. 4 (4), 2011, pp. 480-498
Modern interactive means combined with new digital media processing and representation technologies can provide a robust framework for enhancing user experience in multimedia entertainment systems and audiovisual artistic installations with non-traditional interaction/feedback paths based on user affective state. In this work, the ‘Elevator’ interactive audiovisual platform prototype is presented, which aims to provide a framework for signalling and expressing human behaviour related to emotions (such as anger) and finally produce a visual outcome of this behaviour, defined here as the emotional ‘thumbnail’ of the user. Optimised, real-time audio signal processing techniques are employed for monitoring the achieved anger-like behaviour, while the emotional elevation is attempted using appropriately selected combined audio/visual content reproduced using state-of-the-art audiovisual playback technologies that allow the creation of a realistic immersive audiovisual environment. The demonstration of the proposed prototype has shown that affective interaction is possible, allowing the further development of relative artistic and technological applications.