Javascript must be enabled to continue!

Publications


2026
Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention

M. Heikkinen, A. Politis, K. Drossos, and T. Virtanen, "Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 11-03-2026 08:25 - Size: 1.79 MB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 11-03-2026 08:25 - Size: 401 B
BibTex Record (Popup)
Copy the citation
Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers

M. Silaev, K. Drossos, and T. Virtanen, "Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band (4→16~kHz) and full-band (16→48~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 08-01-2026 09:46 - Size: 404.91 KB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 08-01-2026 09:46 - Size: 395 B
BibTex Record (Popup)
Copy the citation