Publications

All 2026 2025 2024 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2008

Keyword: deep neural networks (2) Back

2026

Moving Speaker Separation Via Parallel Spectral-Spatial Processing [Journal]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, "Moving Speaker Separation Via Parallel Spectral-Spatial Processing," IEEE Transactions on Audio, Speech and Language Processing, 2026

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

https://ieeexplore.ieee.org/document/11425833

Paper (.pdf)
Updated: 11-03-2026 08:20 - Size: 3.8 MB

BibTex record (.bib)
Updated: 11-03-2026 08:20 - Size: 315 B

2025

Multi-Utterance Speech Separation and Association Trained on Short Segments [Conference]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, "Multi-Utterance Speech Separation and Association Trained on Short Segments," in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Tahoe City, CA, USA, 2025

Current deep neural network (DNN) based speech separation faces a fundamental challenge — while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker association across utterance gaps exceeding those seen during training. Unlike the conventional segment-separation-stitch paradigm, our lightweight approach (0.9 M parameters) performs inference on long audio without segmentation, eliminating segment boundary distortions while simplifying deployment. Experimental results demonstrate the generalization ability of FTRNN for multi-utterance speech separation and speaker association.

https://ieeexplore.ieee.org/abstract/document/11230969

Paper (.pdf)
Updated: 26-05-2026 13:13 - Size: 2.53 MB

BibTex record (.bib)
Updated: 26-05-2026 13:13 - Size: 384 B

Show/Hide All

Publications

BiBTeX Record

BiBTeX Record

Konstantinos Drossos®

Konstantinos Drossos^®