E. Rovithis, N. Moustakas, K. Vogklis, K. Drossos, and A. Floros, "Design Recommendations for a Collaborative Game of Bird Call Recognition Based on Internet of Sound Practices," in Journal of the Audio Engineering Society, vo. 69 (12), pp. 956-966, 2021, doi:
Citizen Science aims to engage people in research activities on important issues related to their well-being. Smart Cities aim to provide them with services that improve the quality of their life. Both concepts have seen significant growth in the last years and can be further enhanced by combining their purposes with Internet of Things technologies that allow for dynamic and large-scale communication and interaction. However, exciting and retaining the interest of participants is a key factor for such initiatives. In this paper we suggest that engagement in Citizen Science projects applied on Smart Cities infrastructure can be enhanced through contextual and structural game elements realized through augmented audio interactive mechanisms. Our interdisciplinary framework is described through the paradigm of a collaborative bird call recognition game, in which users collect and submit audio data that are then classified and used for augmenting physical space. We discuss the Playful Learning, Internet of Audio Things, and Bird Monitoring principles that shaped the design of our paradigm and analyze the design issues of its potential technical implementation.
A. Ferraro, X. Favory, K. Drossos, Y. Kim and D. Bogdanov, "Enriched Music Representations with Multiple Cross-modal Contrastive Learning," in IEEE Signal Processing Letters, doi: 10.1109/LSP.2021.3071082.
Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks' audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.
P. Sudarsanam, A. Politis, and K. Drossos, "Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 100-104, Barcelona, Spain, 2021
Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from these high-level features. However, RNNs have some drawbacks, such as a limited capability to model long temporal dependencies and slow training and inference times due to their sequential processing nature. Recently, a few SELD studies used multi-head self-attention (MHSA), among other innovations in their models. MHSA and the related transformer networks have shown state-of-the-art performance in various domains. While they can model long temporal dependencies, they can also be parallelized efficiently. In this paper, we study in detail the effect of MHSA on the SELD task. Specifically, we examined the effects of replacing the RNN blocks with self-attention layers. We studied the influence of stacking multiple self-attention blocks, using multiple attention heads in each self-attention block, and the effect of position embeddings and layer normalization. Evaluation on the DCASE 2021 SELD (task 3) development data set shows a significant improvement in all employed metrics compared to the baseline CRNN accompanying the task.
J. Berg and K. Drossos, "Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 140-144, Barcelona, Spain, 2021
Automated audio captioning (AAC) is the task of automatically creating textual descriptions (i.e. captions) for the contents of a general audio signal. Most AAC methods are using existing datasets to optimize and/or evaluate upon. Given the limited information held by the AAC datasets, it is very likely that AAC methods learn only the information contained in the utilized datasets. In this paper we present a first approach for continuously adapting an AAC method to new information, using a continual learning method. In our scenario, a pre-optimized AAC method is used for some unseen general audio signals and can update its parameters in order to adapt to the new information, given a new reference caption. We evaluate our method using a freely available, pre-optimized AAC method and two freely available AAC datasets. We compare our proposed method with three scenarios, two of training on one of the datasets and evaluating on the other and a third of training on one dataset and fine-tuning on the other. Obtained results show that our method achieves a good balance between distilling new knowledge and not forgetting the previous one.
B. Weck, X. Favory, K. Drossos, and X. Serra, "Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 60-64, Barcelona, Spain, 2021
Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions.
A. Triantafyllopoulos, M. Milling, K. Drossos, and B. - W. Schuller, "Fairness and underspecification in acoustic scene classification: The case for disaggregated evaluations," in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop, pp. 70-74, Barcelona, Spain, 2021
Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community. Acoustic scene classification (ASC) applications have so far remained unaffected by this discussion, but are now becoming increasingly used in real-world systems where fairness and reliability are critical aspects. In this work, we argue for the need of a more holistic evaluation process for ASC models through disaggregated evaluations. This entails taking into account performance differences across several factors, such as city, location, and recording device. Although these factors play a well-understood role in the performance of ASC models, most works report single evaluation metrics taking into account all different strata of a particular dataset. We argue that metrics computed on specific sub-populations of the underlying data contain valuable information about the expected real-world behaviour of proposed systems, and their reporting could improve the transparency and trustability of such systems. We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems exhibited by several standard ML architectures when trained on two widely-used ASC datasets. Our evaluation shows that all examined architectures exhibit large biases across all factors taken into consideration, and in particular with respect to the recording location. Additionally, different architectures exhibit different biases even though they are trained with the same experimental configurations.
X. Favory, K. Drossos, T. Virtanen, and X. Serra, "Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags," in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 6-11, Torono, Canada, 2021
Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
B. - W. Schuller, T. Virtanen, M. Riveiro, G. Rizos, J. Han, A. Mesaros, and K. Drossos, "Towards Sonification in Multimodal and User-friendly Explainable Artificial Intelligence," in Proceedings of the 23rd ACM International Conference on Multimodal Interaction, Oct 18-22, Montreal, Canada, 2021
We are largely used to hearing explanations. For example, if someone thinks you are sad today, they might reply to your “why?” with “because you were so Hmmmmm-mmm-mmm”. Today’s Artificial Intelligence (AI), however, is – if at all – largely providing explanations of decisions in a visual or textual manner. While such approaches are good for communication via visual media such as in research papers or screens of intelligent devices, they may not always be the best way to explain; especially when the end user is not an expert. In particular, when the AI’s task is about Audio Intelligence, visual explanations appear less intuitive than audible, sonified ones. Sonification has also great potential for explainable AI (XAI) in systems that deal with non-audio data – for example, because it does not require visual contact or active attention of a user. Hence, sonified explanations of AI decisions face a challenging, yet highly promising and pioneering task. That involves incorporating innovative XAI algorithms to allow pointing back at the learning data responsible for decisions made by an AI, and to include decomposition of the data to identify salient aspects. It further aims to identify the components of the preprocessing, feature representation, and learnt attention patterns that are responsible for the decisions. Finally, it targets decision-making at the model-level, to provide a holistic explanation of the chain of
processing in typical pattern recognition problems from end-to-end. Sonified AI explanations will need to unite methods for sonification of the identified aspects that benefit decisions, decomposition and recomposition of audio to sonify which parts in the audio were responsible for the decision, and rendering attention patterns and salient feature representations audible. Benchmarking sonified XAI is challenging, as it will require a comparison against a backdrop of existing, state-of-the-art visual and textual alternatives, as well as synergistic complementation of all modalities in user evaluations. Sonified AI explanations will need to target different user groups to allow personalisation of the sonification experience for different user needs, to lead to a major breakthrough in comprehensibility of AI via hearing how decisions are made, hence supporting tomorrow’s humane AI’s trustability. Here, we introduce and motivate the general idea, and provide accompanying considerations including milestones of realisation of sonifed XAI and foreseeable risks.
A. Tran, K. Drossos, and T. Virtanen, "WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information," in proceedings of 29th European Signal Processing Conference (EUSIPCO), Aug. 23-27, Dublin, Ireland, 2021
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of the Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.