D. Luong, K. Drossos, M. Heikkinen, and T. Virtanen, "Automatic Contextual Audio Denoising," in Proceedings of 34th European Signal Conference (EUSIPCO), Bruges, Belgium, 2026
Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising systems apply fixed target-noise definitions, often removing useful components in one context while failing to suppress irrelevant components. To address this, we introduce the concept automatic contextual audio denoising (ACAD) which defines target and noise based on the inferred context. In this work, we restrict context to be associated with an acoustic scene class. We label sound events outside the event distribution of a scene class (noise) as out-of-context (OC) and events typical for that scene as in-context (IC). We implement a deep learning method that automatically infers the context of the audio signal and removes OC components, and benchmark it against variants: without context inference, with oracle context, and with separately provided uninformative context. On paired clean/noisy data across diverse contexts, where OC components in one context may be IC in another, our proposed method outperforms other approaches across standard objective metrics, indicating that the model can infer context and context-dependent processing can enhance denoising.
K. Drosos, M. O. Heikkinen, J. T. Vilkamo, P. Tsiaflakis, “Model for speech enhancement,” U.S. Patent US20260065922A1, filed Aug. 15, 2025; published Mar. 05, 2026
Examples of the disclosure relate to a model that can be used for speech enhancement. The model comprises an encoder part comprising a sequence of encoding layers and caused to receive input data. The input data is based on a current frame of a noisy speech signal and one or more past frames of the noisy speech signal. The sequence of encoding layers is caused to process the input data so that output data of the encoder part comprises a reduced number of the multiple frequency positions and a single temporal position. The model also comprises a decoder part comprising a sequence of decoding layers caused to receive data from a prior decoding layer. The output data of the decoder part comprises multiple frequency positions and a single temporal position. The output data of the decoder part is for post processing to provide an output signal for speech enhancement.