Publications

Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention [Conference]

M. Heikkinen, A. Politis, K. Drossos, and T. Virtanen, "Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.

https://arxiv.org/abs/2601.23196

Paper (.pdf)
Updated: 11-03-2026 08:25 - Size: 1.79 MB

BibTex record (.bib)
Updated: 11-03-2026 08:25 - Size: 401 B

Method and apparatus for training and using a microphone geometry assisted encoder model to generate spatial audio signals technological field [Patents]

M. O. Heikkinen, K. Drosos, A. Politis, and T. Virtanen, “Method and apparatus for training and using a microphone geometry assisted encoder model to generate spatial audio signals,” U.S. Patent US20260065918A1, filed Aug. 28, 2025; published Mar. 05, 2026

A system for training a microphone geometry assisted encoder model and then utilizing the trained model to generate spatial audio signals that have been captured by a plurality of microphones. In a method for generating spatial audio signals, the method includes receiving geometry data related to a plurality of microphones of an audio capturing device and audio signal data captured by the plurality of microphones. The method also includes generating a spatial audio signal based on an output of a trained microphone geometry assisted encoder model. The trained microphone geometry assisted encoder model includes a geometry encoder configured to encode the geometry data and a signal encoder configured to encode the audio signal data. The trained microphone geometry assisted encoder model further includes a signal decoder having a plurality of layers and configured to generate the output upon which the spatial audio signal is based.

https://patents.google.com/patent/US20260065918A1/en

Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays [Conference]

M. Heikkinen, A. Politis, K. Drossos and T. Virtanen, "Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025

Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The method takes as inputs the MA geometry and MA signals and uses a multi-level encoder consisting of separate paths for geometry and signal data, where geometry features inform the signal encoder at each level. The method is validated in simulated anechoic and reverberant conditions with one and two sources. The results indicate improvement over conventional encoding across the whole frequency range for dry scenes, while for reverberant scenes the improvement is frequency-dependent.

https://ieeexplore.ieee.org/document/10887869

Paper (.pdf)
Updated: 21-09-2025 17:04 - Size: 572.69 KB

BibTex record (.bib)
Updated: 21-09-2025 17:04 - Size: 398 B

BiBTeX Record

BiBTeX Record

Konstantinos Drossos®

Konstantinos Drossos^®