Javascript must be enabled to continue!

Publications

Keyword: speech separation (2) Back

2026
Moving Speaker Separation Via Parallel Spectral-Spatial Processing [Journal]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, "Moving Speaker Separation Via Parallel Spectral-Spatial Processing," IEEE Transactions on Audio, Speech and Language Processing, 2026

Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 11-03-2026 08:20 - Size: 3.8 MB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 11-03-2026 08:20 - Size: 315 B
BibTex Record (Popup)
Copy the citation
2025
Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers [Conference]

Y. Wang, A. Politis, K. Drossos, and T. Virtanen, “Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers," in INTERSPEECH 2025, Rotterdam, Netherlands, 2025

This paper addresses the problem of single-channel speech separation, where the number of speakers is unknown, and each speaker may speak multiple utterances. We propose a speech separation model that simultaneously performs separation, dynamically estimates the number of speakers, and detects individual speaker activities by integrating an attractor module. The proposed system outperforms existing methods by introducing an attractor-based architecture that effectively combines local and global temporal modeling for multi-utterance scenarios. To evaluate the method in reverberant and noisy conditions, a multi-speaker multi-utterance dataset was synthesized by combining Librispeech speech signals with WHAM! noise signals. The results demonstrate that the proposed system accurately estimates the number of sources. The system effectively detects source activities and separates the corresponding utterances into correct outputs in both known and unknown source count scenarios.

Attachment language: English File type: PDF document Paper (.pdf)
Updated: 23-09-2025 10:45 - Size: 1.5 MB
Attachment language: English File type: BiBTex LaTeX BibTex record (.bib)
Updated: 23-09-2025 10:45 - Size: 276 B
BibTex Record (Popup)
Copy the citation