Publications

All 2026 2025 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2010 2008

2026

Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention

M. Heikkinen, A. Politis, K. Drossos, and T. Virtanen, "Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.

https://arxiv.org/abs/2601.23196

Paper (.pdf)
Updated: 11-03-2026 08:25 - Size: 1.79 MB

BibTex record (.bib)
Updated: 11-03-2026 08:25 - Size: 401 B

Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers

M. Silaev, K. Drossos, and T. Virtanen, "Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band (4→16~kHz) and full-band (16→48~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.

https://arxiv.org/abs/2601.03443

Paper (.pdf)
Updated: 08-01-2026 09:46 - Size: 404.91 KB

BibTex record (.bib)
Updated: 08-01-2026 09:46 - Size: 395 B

Show/Hide All

Publications

BiBTeX Record

BiBTeX Record

Konstantinos Drossos®

Konstantinos Drossos^®