Noise Cancellation in C++ and Rust: Resume

April 29, 2026 · Luciano Muratore

Introduction

This article documents the design and implementation of a real-time noise cancellation system built from scratch. The system captures audio from a microphone, removes background noise using a deep learning model, and plays back the clean audio through headphones — all with less than 50 milliseconds of total latency.

The project went through seven phases of development, each solving a specific problem. This article covers the final architecture, the key technical decisions, and what changed along the way.


The Architecture: C++ → AI Engine → Rust

The system is composed of three programs running simultaneously, connected by a message queue.

Microphone


C++ Audio Engine
    │  ZeroMQ PUSH (port 5555)
    │  1536 samples every 34.8ms

Rust Inference Server
    │  ZeroMQ PUSH (port 5556)
    │  1536 clean samples every 34.8ms

C++ Audio Engine


Headphones

The C++ audio engine uses PortAudio to capture audio from the microphone at 44100 Hz. Every 34.8 milliseconds it has accumulated 1536 samples, which it serialises to raw bytes and sends over ZeroMQ. It then waits to receive the same number of cleaned samples back, which it plays through the headphones. The C++ layer has no knowledge of AI — it just moves audio in and out.

The Rust inference server receives each chunk, runs it through the DeepFilterNet3 noise cancellation model, and sends the clean audio back. It runs as a tight loop with no garbage collector and no runtime pauses.

ZeroMQ connects the two programs using a PUSH/PULL pattern. Each chunk is sent as a raw array of 32-bit floating point numbers in little-endian byte order (6144 bytes per chunk). This is the simplest possible wire format — no headers, no framing, no serialisation overhead.

The C++ engine does not change between experiments. Once the pipeline was proven correct, all further improvements were made exclusively in the Rust server.


Why Rust Is Faster Than Python

The original server was written in Python using PyTorch. It worked, but it had a fundamental problem: the Python GIL.

The Global Interpreter Lock (GIL) is a mutex inside the Python runtime that prevents more than one thread from executing Python code at the same time. Even on a multi-core machine, Python can only run one thread at a time. During AI inference — which takes around 25 milliseconds — the GIL blocks everything else, including the thread that receives audio chunks from C++. This causes chunks to queue up faster than they are processed, resulting in dropped audio.

Switching to Rust eliminated this problem entirely for three reasons:

No GIL. Rust has no runtime lock. Multiple threads run truly in parallel on multiple CPU cores without any coordination overhead.

No garbage collector. Python periodically pauses to reclaim unused memory. These pauses are unpredictable and can last several milliseconds — enough to cause a drop in a real-time audio pipeline. Rust manages memory at compile time through its ownership system. There are no pauses.

ONNX Runtime instead of PyTorch. The Python server used PyTorch, which carries significant overhead for dynamic graph construction and Python object management. The Rust server uses ONNX Runtime (ort 2.0.0-rc.12), which runs the model as a static optimised computation graph. Inference time dropped from ~25ms to ~12ms.

The result was a drop rate reduction from ~1% (Python with sleep fix) to ~0% (Rust), and inference time improvement of more than 2x.


How the AI Engine Works

The noise cancellation model is DeepFilterNet3, a state-of-the-art model developed by the Fraunhofer Institute. It does not work like a simple filter — it uses a neural network to estimate which parts of the audio are voice and which are noise, then suppresses the noise while leaving the voice intact.

DeepFilterNet3 is actually three separate ONNX models that must be orchestrated in sequence:

enc.onnx — the Encoder (1.9 MB) The encoder converts raw audio into a frequency-domain representation and extracts two types of features:

  • ERB (Equivalent Rectangular Bandwidth) features: 32 values representing the log-power of the audio across 32 perceptual frequency bands, normalised with an exponential moving average.
  • Complex features: 96 complex frequency bins (real and imaginary parts separately), representing the fine structure of the lower frequency range.

These features are fed into the encoder, which produces an embedding vector and several skip-connection tensors that carry information to the decoders.

erb_dec.onnx — the ERB Decoder (3.3 MB) The ERB decoder takes the encoder embedding and produces a gain mask: 32 values between 0 and 1, one per ERB band. This mask is applied to the full frequency spectrum. A gain of 1 means “keep this frequency band unchanged.” A gain of 0 means “suppress this frequency band completely.” Noise is suppressed by driving its gain values toward zero.

df_dec.onnx — the Deep Filter Decoder (3.3 MB) The deep filter decoder handles the lower frequency range (approximately 0–4.8 kHz) with higher precision. Instead of a simple gain mask, it produces complex filter coefficients for each of the 96 lower frequency bins. These coefficients implement a finite impulse response (FIR) filter applied across 5 consecutive frames, allowing the model to use temporal context when reconstructing the voice signal. This is particularly important for voiced consonants and vowels.

The processing pipeline for each 10ms frame of audio is:

Raw audio frame (480 samples @ 48 kHz)

    ▼ STFT analysis (FFT size 960, hop 480)
Complex spectrum [481 bins]

    ├──► ERB features [32]  ──► enc.onnx ──► emb, e0-e3, c0, lsnr
    └──► Complex features [96×2]

                              ┌─────┴──────┐
                              ▼            ▼
                         erb_dec       df_dec
                              │            │
                        ERB mask [32]  DF coefs [96×10]
                              │            │
                              ▼            ▼
                      Apply mask    Apply DF filter
                      to full       to lower 96 bins
                      spectrum


                    ISTFT synthesis

                    Enhanced audio frame

One important discovery during development: the ONNX models are stateless at their inputs and outputs. The GRU (Gated Recurrent Unit) hidden states, which give the model its temporal memory, are managed internally inside the ONNX graph. This means the Rust server does not need to carry any state between frames — the models handle it themselves.


The Ring Buffer

The most important architectural change between Phase 5 and Phase 6 was the introduction of continuous ring buffers.

The problem with Phase 5

In the original implementation, each incoming chunk was processed independently:

  1. Take 1536 samples at 44100 Hz
  2. Resample to 48000 Hz → ~1672 samples
  3. Split into 480-sample hops and process each one
  4. Resample output back to 44100 Hz
  5. Pad or trim to exactly 1536 samples

The problem is step 5. Because 1672 / 480 = 3.48, each chunk contains exactly 3 complete hops plus 232 leftover samples. Those 232 samples were discarded, and the output was padded with zeros to reach 1536. This silenced approximately 15% of every chunk, causing significant voice loss. Long vowels sounded truncated. Fast speech was almost unintelligible.

There was a second problem: the STFT inside DFState maintains internal overlap-add buffers between frames. When we restarted processing fresh for each chunk, these buffers were implicitly reset, causing a discontinuity at every chunk boundary — the “shshshsh” glitch heard every 35 milliseconds.

The ring buffer solution

Phase 6 replaced the per-chunk approach with continuous buffers that persist across the entire session:

[input_buf_44]  ← incoming 44100 Hz samples accumulate here

  Resampler (continuous, never restarted)

[proc_buf_48]   ← 48000 Hz samples accumulate here
      │  drain 480 at a time
  process_frame()
      │  480 enhanced samples
[output_buf_44] ← downsampled 44100 Hz samples accumulate here
      │  drain 1536 when ready
  ZeroMQ PUSH

Nothing resets between chunks. The resampler runs as a continuous stream. The STFT buffers inside DFState carry their state naturally from one frame to the next. The output buffer accumulates downsampled frames until it has exactly 1536 samples, then drains them.

The result was immediate: the shshshsh glitch disappeared, long vowels were heard in full, and all words in a sentence were reproduced completely.


What We Changed on the AI Engine

The system went through seven phases of development. Here is a summary of the significant changes made to the AI engine specifically:

Phase 3 — First Rust server (broken model) The first attempt used a single ONNX model exported with torch.jit.trace. This produced a 0.22 MB file that froze the GRU hidden state as a constant, outputting white noise. The pipeline worked perfectly (0% drop rate, 0.95ms inference), but the model was wrong.

Phase 5 — Three-model orchestration The broken single model was replaced with the three official ONNX models from the DeepFilterNet repository. The encoder, ERB decoder, and deep filter decoder were wired together correctly. The tensor shapes were discovered by running inspect_onnx_models.py, which revealed that the GRU states are internal to the graph and do not need to be managed externally. Audio quality improved from 1/5 to 4/5.

Phase 6 — Continuous ring buffers The per-chunk processing was replaced with continuous ring buffers as described above. The chunk boundary glitches and voice loss were eliminated. The model orchestration itself did not change — only the way audio was fed into it and collected from it.

Phase 7 — Voice Activity Detection A VAD gate was added using the lsnr (local signal-to-noise ratio) output from the encoder. During Phase 6 testing, a structured silence/speech/silence experiment revealed that lsnr reliably separates voice from silence:

  • Silence: lsnr clusters at −10 to −15 dB
  • Speaking: lsnr clusters at +20 to +35 dB

A threshold of +10 dB was set, with a 750ms hold-off to prevent the gate from closing between syllables. When lsnr is below the threshold and the hold-off has expired, the server outputs silence instead of running the decoders. This eliminated the residual background noise that leaked through the ERB mask when no voice was present.

The VAD gate runs after the encoder but before the decoders. This means the encoder always runs (necessary to compute lsnr), but the more expensive erb_dec and df_dec are skipped during silence, slightly reducing average inference time.


Results Summary

PhaseChangeDrop rateInferenceAudio quality
E01Baseline (Python, broken sleep)47%~25ms2/5
E02Sleep fix (microsecond precision)~1%~25ms3/5
E03Rust + ONNX (broken model)0%0.95ms1/5
E04Three-model pipeline~0.06%~12ms4/5
E05Continuous ring buffers~0%~11ms4/5
E06Voice Activity Detection~0%~11ms4/5+

The most impactful single change was the sleep precision fix — a one-line correction that reduced drop rate from 47% to 1%. The most impactful architectural change was moving to Rust with ONNX Runtime, which eliminated drops entirely and reduced inference time by more than half. The ring buffer and VAD improvements addressed audio quality directly, eliminating the remaining glitches and background noise.


Conclusion

Building a real-time noise cancellation pipeline is fundamentally a latency and reliability problem. Every component in the chain must stay within a strict time budget — in this case, 34.8 milliseconds per chunk. Python, despite its excellent AI ecosystem, cannot reliably meet this budget due to GIL pauses and garbage collection. Rust, combined with ONNX Runtime, can.

The DeepFilterNet3 model itself is sophisticated, but using it correctly requires understanding its three-model architecture, the STFT processing it expects, and the exact tensor shapes at each interface. The most reliable way to discover these is to inspect the model files directly rather than relying on documentation.

The final system runs entirely on CPU, processes audio in real time with less than 50ms of latency, drops essentially no chunks, and clearly separates voice from background noise.

Github Link: https://github.com/Dextromethorpan/Noise_Cancellation