Noise Cancellation in C++ and Python: Phase 2 — AI-Powered Denoising with DeepFilterNet

April 29, 2026 · Luciano Muratore

Phase 1 established the C++ audio pipeline — capturing audio from a microphone and playing it back in real time. Phase 2 moves to Python and introduces the intelligence layer: a pre-trained deep learning model called DeepFilterNet that removes background noise from audio signals.

At the end of this phase, the Python layer works in isolation, taking a noisy .wav file as input and producing a clean one as output. This proves the AI model works correctly before it is connected to the live C++ audio stream in Phase 3.

Why Python for the AI Layer

Deep learning frameworks like PyTorch are Python-first. The entire ecosystem — model training, inference, audio processing utilities — is built around Python. Trying to run DeepFilterNet directly in C++ would mean reimplementing a large part of that ecosystem from scratch.

Python handles the AI layer cleanly. C++ handles the real-time audio layer cleanly. Each language does what it does best, and a bridge connects them. That separation is the core architectural decision of this project.

What is DeepFilterNet

DeepFilterNet is an open-source neural network designed specifically for speech enhancement — the technical term for removing noise from voice audio. It was developed by Hendrik Schröter and colleagues and is one of the best-performing open-source models available for this task.

It works in the frequency domain. For each incoming audio chunk, it converts the signal from raw samples into a frequency representation using a short-time Fourier transform (STFT), then uses a recurrent neural network to predict which frequency components belong to the voice and which belong to the noise. The noise components are suppressed, and the signal is converted back into audio samples.

The model shipped as DeepFilterNet3 — the third generation — is pre-trained on thousands of hours of speech mixed with hundreds of different noise types: fans, traffic, crowds, keyboard clicks, and more. This means it can be used directly without any training step.

The Python Environment

Before installing any dependencies, an isolated Python virtual environment is created. This ensures DeepFilterNet’s specific library versions do not interfere with other Python projects on the same machine.

cd NoiseCancellation\python
py -3.11 -m venv venv
venv\Scripts\activate

Python 3.11 is used specifically because DeepFilterNet’s dependencies — particularly PyTorch and torchaudio — have not yet been fully validated against Python 3.13 at the time of writing. Using a dedicated virtual environment also means the exact set of dependencies can be reproduced later by anyone working on the project.

The required packages are installed in a specific order:

pip install torch==2.0.1 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cpu
pip install deepfilternet soundfile numpy

PyTorch is installed from the CPU-only index because this phase runs on the processor rather than a GPU. The model is small enough that CPU inference is fast enough for a five-second audio file. In Phase 4, switching to GPU inference or ONNX Runtime will be considered as an optimization for real-time performance.

Generating a Test Audio File

To test the model without recording real audio, a synthetic noisy file is generated programmatically. The script simulates a voice signal as a combination of sine waves at 200Hz, 400Hz, and 800Hz — approximating the fundamental frequency and harmonics of a human voice — and mixes it with white noise to simulate background interference.

import numpy as np
import wave

def generate_noisy_audio(output_path: str, duration: float = 5.0):
    sample_rate = 48000  # DeepFilterNet expects 48kHz
    num_samples = int(sample_rate * duration)

    t = np.linspace(0, duration, num_samples)

    # Simulate a voice signal with harmonics
    voice = 0.4 * np.sin(2 * np.pi * 200 * t)
    voice += 0.2 * np.sin(2 * np.pi * 400 * t)
    voice += 0.1 * np.sin(2 * np.pi * 800 * t)

    # Simulate background noise
    noise = 0.3 * np.random.randn(num_samples)

    # Mix and normalize
    noisy_signal = voice + noise
    noisy_signal = noisy_signal / np.max(np.abs(noisy_signal))

    # Convert to 16-bit PCM and save
    samples_int16 = (noisy_signal * 32767).astype(np.int16)
    with wave.open(output_path, 'w') as wav_file:
        wav_file.setnchannels(1)
        wav_file.setsampwidth(2)
        wav_file.setframerate(sample_rate)
        wav_file.writeframes(samples_int16.tobytes())

if __name__ == "__main__":
    generate_noisy_audio("noisy_input.wav")

The sample rate of 48,000 Hz is important. DeepFilterNet3 is trained to operate at 48kHz, which is different from the 16kHz used by the C++ audio engine in Phase 1. This difference will be addressed in Phase 3 by resampling the audio before passing it to the model.

The Denoising Script

The denoising script loads the pre-trained model, processes the noisy audio, and saves the result.

import torch
from df.enhance import enhance, init_df, load_audio, save_audio

def denoise_file(input_path: str, output_path: str):
    print("Loading model...")

    # 1. Initialize DeepFilterNet
    #    Downloads the pre-trained model on first run (~50MB)
    model, df_state, _ = init_df()

    print("Model loaded!")
    print(f"Processing: {input_path}")

    # 2. Load the noisy audio file at the model's expected sample rate
    audio, _ = load_audio(input_path, sr=df_state.sr())

    sample_rate = df_state.sr()
    print(f"  Sample rate:  {sample_rate} Hz")
    print(f"  Duration:     {audio.shape[-1] / sample_rate:.2f} seconds")
    print(f"  Shape:        {audio.shape}")

    # 3. Run the AI model
    enhanced_audio = enhance(model, df_state, audio)

    # 4. Save the cleaned audio
    save_audio(output_path, enhanced_audio, df_state.sr())

    print(f"Done! Clean audio saved to: {output_path}")

if __name__ == "__main__":
    denoise_file(
        input_path="noisy_input.wav",
        output_path="clean_output.wav"
    )

Walking Through the Code

init_df() initializes the DeepFilterNet3 model. On the first run it downloads the pre-trained checkpoint (~50MB) and stores it in the local user cache. On subsequent runs it loads directly from the cache. The function returns three values: the model itself, a df_state object that carries configuration like sample rate and frame size, and metadata that can be ignored for basic usage.

load_audio() reads the .wav file and resamples it to the model’s expected sample rate if necessary. The audio is returned as a PyTorch tensor of shape [channels, samples]. In the mono case this is [1, 240000] for five seconds at 48kHz.

enhance() is where the noise cancellation happens. It runs the audio through the neural network frame by frame and returns a tensor of the same shape containing the cleaned signal. This is the function that will eventually be called on each chunk of audio arriving from C++ in Phase 3.

save_audio() writes the enhanced tensor back to a .wav file at the correct sample rate.

The Output

Running the script produces the following output:

Loading model...
INFO | DF | Running on torch 2.0.1+cpu
INFO | DF | Loading model settings of DeepFilterNet3
INFO | DF | Using DeepFilterNet3 model at ...Cache\DeepFilterNet3
INFO | DF | Found checkpoint model_120.ckpt.best with epoch 120
INFO | DF | Running on device cpu
INFO | DF | Model loaded
Model loaded!
Processing: noisy_input.wav
  Sample rate:  48000 Hz
  Duration:     5.00 seconds
  Shape:        torch.Size([1, 240000])
Done! Clean audio saved to: clean_output.wav

The checkpoint at epoch 120 means the model was trained for 120 full passes over its training dataset before being released. Listening to both files confirms the model works — the white noise that is clearly audible in noisy_input.wav is substantially reduced in clean_output.wav while the voice-like signal is preserved.

What This Phase Establishes

Phase 2 establishes that the AI model works correctly in isolation. The model loads, processes audio in the right format, and produces a cleaner output. It also surfaces an important detail for Phase 3: the model expects 48kHz audio, while the C++ engine captures at 16kHz. The bridge layer will need to handle this resampling step, either on the C++ side before sending or on the Python side before processing.

The next phase introduces ZeroMQ as the communication bridge between the C++ audio engine and this Python denoising layer, connecting the two into a single real-time pipeline.

Github Link: https://github.com/Dextromethorpan/Noise_Cancellation