Publications

Moving Speaker Separation Via Parallel Spectral-Spatial Processing [Journal]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, "Moving Speaker Separation Via Parallel Spectral-Spatial Processing," IEEE Transactions on Audio, Speech and Language Processing, 2026

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

https://ieeexplore.ieee.org/document/11425833

Paper (.pdf)
Updated: 11-03-2026 08:20 - Size: 3.8 MB

BibTex record (.bib)
Updated: 11-03-2026 08:20 - Size: 315 B

Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention [Conference]

M. Heikkinen, A. Politis, K. Drossos, and T. Virtanen, "Beyond Omnidirectional: Neural Ambisonics Encoding for Arbitrary Microphone Directivity Patterns using Cross-Attention," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.

https://arxiv.org/abs/2601.23196

Paper (.pdf)
Updated: 11-03-2026 08:25 - Size: 1.79 MB

BibTex record (.bib)
Updated: 11-03-2026 08:25 - Size: 401 B

Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers [Conference]

M. Silaev, K. Drossos, and T. Virtanen, "Discriminating Real And Synthetic Super-Resolved Audio Samples Using Embedding-Based Classifiers," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026

Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band (4→16~kHz) and full-band (16→48~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.

https://arxiv.org/abs/2601.03443

Paper (.pdf)
Updated: 08-01-2026 09:46 - Size: 404.91 KB

BibTex record (.bib)
Updated: 08-01-2026 09:46 - Size: 395 B

BiBTeX Record

BiBTeX Record

BiBTeX Record

Konstantinos Drossos®

Konstantinos Drossos^®