Noise Cancellation in C++ and Python: Phase 4 — Optimization, Experiments and the Rust Server

April 29, 2026 · Luciano Muratore

Phase 3 established the full pipeline — C++ capturing audio, ZeroMQ transporting it, Python running DeepFilterNet, and clean audio returning to the speakers. Phase 4 is about understanding why the audio was distorted and systematically improving it through three experiments.

The Starting Problem

After connecting all three components, the audio output was heavily distorted. Words were cut off mid-syllable in a pattern that sounded like a broken radio signal. The pipeline was clearly running — both terminals showed chunk counters incrementing — but something was badly wrong with the timing.

The Python server output revealed the root cause immediately:

Queue: in=10 out=1
Queue: in=10 out=1
Queue: in=10 out=1

The input queue was always at maximum capacity. C++ was sending audio chunks faster than Python could process them. Every time the queue filled up, chunks were dropped — and dropped chunks meant gaps in the audio output.

Experiment E01 — The Baseline

The first experiment established the baseline metrics. Running the pipeline with default settings produced these numbers:

Sent:     1205 chunks
Received:  635 chunks
Dropped:   570 chunks
Drop rate: 47%

Nearly half of all audio chunks were being discarded. The queue was full constantly, which meant Python was consistently processing more slowly than C++ was sending.

The sender thread used this sleep calculation:

std::this_thread::sleep_for(
    std::chrono::milliseconds(FRAMES * 1000 / SAMPLE_RATE));

At first glance this looks correct. With FRAMES = 1536 and SAMPLE_RATE = 44100, the intended sleep duration is 34.829 milliseconds — exactly one audio buffer period. But C++ integer division truncates the result. The actual sleep was 34 milliseconds, not 34.829.

That 0.829 millisecond error accumulates. Over a 30-second session, C++ sends approximately 47% more chunks than it should. The queue fills instantly and never drains. The audio sounds like a broken radio.

Experiment E02 — The Sleep Fix

The fix was one line. Switching from milliseconds to microseconds:

std::this_thread::sleep_for(
    std::chrono::microseconds(FRAMES * 1000000 / SAMPLE_RATE));
// 1536 * 1000000 / 44100 = 34829 microseconds = 34.829ms exactly

The results were dramatic:

Before (ms sleep):   Drop rate 47%  | Queue 10/10 | Audio 2/5
After  (us sleep):   Drop rate  1%  | Queue  0/10 | Audio 3/5

The queue dropped from constantly full to almost always empty. Drop rate fell from 47% to 1%. Audio quality improved — slow speech was now intelligible through the headphones. The remaining 1% of drops came from occasional Python GIL pauses during heavy processing bursts, which are non-deterministic and cannot be eliminated by timing alone.

This experiment demonstrated a fundamental lesson: in real-time audio, timing precision matters more than processing speed. A 0.829ms timing error caused more damage than 25ms of inference latency.

Experiment E03 — Rust and ONNX Runtime

The Python GIL is the root cause of the remaining 1% drop rate. When the garbage collector or another thread acquires the GIL, Python’s processing thread pauses unpredictably. For real-time audio, any pause longer than one buffer period (34.8ms) causes a gap in the output.

Rust has no GIL and no garbage collector. Memory is managed at compile time through the ownership system. A Rust inference server should produce zero drops regardless of processing load.

The plan was to export DeepFilterNet to ONNX format and run it with the ort crate — the official ONNX Runtime bindings for Rust.

The ONNX Export Attempt

The first export attempt used torch.jit.trace:

traced = torch.jit.trace(pipeline, dummy_input)
torch.onnx.export(traced, dummy_input, output_path, ...)

This produced a 0.22 MB file. The real DeepFilterNet3 model should be approximately 8.5 MB. The file was too small by a factor of 38.

The reason is that DeepFilterNet3 is not a single model — it is three separate ONNX models that must be orchestrated together:

enc.onnx (1.9 MB) — the encoder, converts raw audio into frequency domain features
erb_dec.onnx (3.3 MB) — the ERB decoder, predicts a noise suppression mask over equivalent rectangular bandwidth bands
df_dec.onnx (3.3 MB) — the deep filter decoder, applies complex spectral filtering

When torch.jit.trace ran a single forward pass, it captured the computation graph but froze the RNN hidden state as a constant. The resulting file contained only the frozen state pattern, not the learned weights. Running inference with this file produced a constant noise pattern — the white noise heard through the headphones.

The Rust Server

Despite the model issue, the Rust server itself performed correctly. The pipeline architecture was proven:

C++    → Sent:     2400+ chunks
C++    → Received: 2400+ chunks
Drop rate: 0%

Rust   → Avg inference: 0.95ms

Zero drops over a 90-second session. Inference at 0.95ms per chunk — 26 times faster than the Python PyTorch server at 25ms. No GIL, no pauses, no drops.

The shshshsh white noise proved the pipeline was delivering audio to the headphones correctly. Every component worked: C++ captured audio, ZeroMQ transported it, Rust received and processed it, ZeroMQ returned it, C++ played it. The only broken piece was the model content.

The Official ONNX Models

The DeepFilterNet GitHub repository provides official ONNX models designed for the Rust implementation. These were downloaded and extracted:

ai/python/models/tmp/export/
├── config.ini      ← model configuration
├── enc.onnx        ← 1.9 MB encoder
├── erb_dec.onnx    ← 3.3 MB ERB decoder
└── df_dec.onnx     ← 3.3 MB deep filter decoder

Orchestrating these three models correctly through the DSP pipeline — including STFT analysis, feature extraction, stateful RNN inference across chunks, and ISTFT synthesis — is the task for Phase 5.

The Rust Build Process

Building the Rust server against ort 2.0.0-rc.12 required resolving four API issues specific to that release candidate:

The session.run() method required a Vec rather than a fixed-size array literal for its input. The method try_extract_raw_tensor did not exist in this version — the correct method was try_extract_tensor. The .into() conversion for tensor values was ambiguous due to multiple trait implementations — calling .into_dyn() first erased the type parameter and resolved the ambiguity. Finally, session.run() in this version takes &mut self, which required declaring the session with let mut.

The complete build troubleshooting process is documented in docs/noise_server_report.docx.

The Windows Audio System

A significant portion of Phase 4 was spent understanding why audio output was not working through the headphones. The investigation revealed important details about how Windows handles audio:

Windows exposes the same physical audio device through four different APIs — MME, DirectSound, WASAPI, and WDM-KS. PortAudio sees each API’s representation as a separate device, resulting in 25+ entries in the device list for the same underlying hardware.

WASAPI and WDM-KS use exclusive mode, which takes full control of the audio device and kicks other applications (including YouTube) out. They are also strict about sample rates and refuse combinations that do not match the system configuration.

MME uses shared mode, where Windows mixes audio from all applications at a fixed sample rate. Multiple applications can use the device simultaneously. This is what Phase 1 used, and it is what works reliably.

The Bluetooth headphones added another complication. The WH-CH720N operates in two incompatible modes: A2DP stereo (high quality, no microphone) and HFP Hands-Free (low quality, microphone available). Using the Bluetooth headset microphone forces Windows to switch the headphones into HFP mode, which mutes the stereo output. The solution was to use the laptop’s built-in Realtek microphone as input and keep the Bluetooth headphones in stereo mode for output.

The working device combination:

Input:  device [1] Microfoon (Realtek Audio)     | MME | 44100 Hz
Output: device [4] Headphones (WH-CH720N Stereo) | MME | 44100 Hz

Benchmark Results

Before building the Rust server, three Python inference approaches were benchmarked:

Method              Avg     Min     Max     Std     Budget  Result
PyTorch CPU         24.6ms  14.0ms  74.1ms  9.2ms   34.8ms  PASS
inference_mode      32.9ms  21.6ms  139.5ms 23.2ms  34.8ms  PASS
torch.compile       32.4ms  15.7ms  90.7ms  15.2ms  34.8ms  PASS (unsupported on Windows)
ONNX Runtime (Py)    0.2ms   0.1ms   1.9ms   0.3ms  34.8ms  PASS
ONNX Runtime (Rust)  0.95ms  —       —       —       34.8ms  PASS

Plain PyTorch gave the most consistent results on CPU. The ONNX Runtime verification showed 0.2ms average — 125 times faster than PyTorch — which confirmed that the ONNX approach is correct in principle, even though the model export produced a broken file.

What Phase 4 Established

Three things were proven in Phase 4.

First, timing precision is the most important factor in real-time audio pipelines. A sub-millisecond error in the sleep calculation caused 47% packet loss. Switching to microseconds fixed it to 1%.

Second, the Rust pipeline architecture works. Zero drops, 0.95ms inference, no GIL pauses. The full C++ → ZeroMQ → Rust → ZeroMQ → C++ path is proven correct.

Third, model file size is a sanity check. A 0.22 MB file for a model that should be 8.5 MB is a clear signal that the export failed. Always verify model size before connecting it to a pipeline.

Phase 5 will orchestrate the three official ONNX models through the deep_filter DSP pipeline in Rust, completing the pure C++/Rust noise cancellation system with no Python dependency at runtime.

Github Link: https://github.com/Dextromethorpan/Noise_Cancellation