Noise Cancellation in C++ and Python: Phase 3 — Bridging C++ and Python with ZeroMQ

April 29, 2026 · Luciano Muratore

Phase 1 established the C++ audio pipeline. Phase 2 proved the Python AI model works in isolation. Phase 3 connects them into a single real-time system. This is where the architecture becomes real — and where the hard problems surface.

What the Bridge Needs to Do

The bridge has one job: move audio data between two separate processes with low enough latency that the result feels real-time. Every 34 milliseconds, C++ has a new chunk of 1536 audio samples ready. Python needs to receive those samples, run DeepFilterNet, and send back 1536 clean samples — all before the next chunk arrives.

[Mic] → [C++ 44100Hz] → [ZeroMQ] → [Python: resample + DeepFilterNet]
                                                    ↓
[Speakers] ← [C++ 44100Hz] ← [ZeroMQ] ← [Python: resample back]

Why ZeroMQ

ZeroMQ is a messaging library that lets two separate processes exchange data over a socket. It sits between raw TCP sockets (which require manual framing and protocol design) and full message brokers like RabbitMQ (which are designed for distributed systems at scale). For two processes on the same machine exchanging audio buffers, ZeroMQ hits the right balance of speed and simplicity.

The pattern used here is REQ/REP — request and reply. C++ sends a chunk of noisy audio and waits. Python receives it, processes it, and sends back clean audio. C++ receives the reply and the cycle repeats.

C++ (REQ socket)               Python (REP socket)
────────────────               ───────────────────
Send noisy audio      ──→      Receive noisy audio
                               Run DeepFilterNet
Receive clean audio   ←──      Send clean audio

Both sides connect to the same local address: tcp://127.0.0.1:5555. The 127.0.0.1 address means the traffic stays entirely on the local machine and never touches the network.

Building ZeroMQ on Windows

ZeroMQ ships as source code rather than pre-built binaries, which is typical for C++ libraries. The reason is that C++ binaries are tied to the specific compiler, version, and settings used to build them. A binary compiled with GCC on Linux will not work with MSVC on Windows. By shipping source, the library guarantees compatibility regardless of the developer’s toolchain.

Building it requires loading the MSVC environment and running CMake:

"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
cd C:\dev\zeromq
mkdir build && cd build
cmake .. -G "Visual Studio 17 2022" -A x64 -DBUILD_TESTS=OFF
cmake --build . --config Release

This produces libzmq-v143-mt-4_3_5.dll and libzmq-v143-mt-4_3_5.lib — the engine and the connector, following the same pattern as PortAudio from Phase 1.

The Sample Rate Problem

DeepFilterNet expects audio at 48000 Hz. The Windows MME audio devices run at 44100 Hz. These two rates cannot be used together directly — a conversion is needed somewhere in the pipeline.

The conversion ratio between 44100 and 48000 is 480/441, a fractional value. This is significantly harder to implement correctly in C++ than the clean 3x ratio that would exist between 16000 and 48000 Hz. A proper implementation requires a polyphase resampler, which is a non-trivial piece of signal processing code.

Python handles this conversion instead, using torchaudio.transforms.Resample. This class implements a high-quality polyphase resampler in a single line and handles the fractional ratio correctly:

resampler_up   = T.Resample(44100, 48000)  # before DeepFilterNet
resampler_down = T.Resample(48000, 44100)  # after DeepFilterNet

Keeping the resampling in Python means C++ stays simple and the model always receives exactly the format it expects.

The Windows Audio API Problem

Windows exposes audio devices through multiple host APIs — MME, WASAPI, DirectSound, and ASIO — each with different characteristics. PortAudio sees all of them and assigns each device a unique index number.

The first attempt used WASAPI, Microsoft’s modern low-latency audio API. WASAPI is strict: it runs in shared mode by default, where Windows mixes all audio at a fixed system sample rate. If any application requests a different rate, WASAPI refuses. It also refuses to combine input and output devices from different host APIs in the same stream.

After testing, the correct approach was to use the MME default devices — the same ones that worked in Phase 1. MME is the oldest Windows audio API, dating back to Windows 95, but it is permissive and compatible. Both the default input and output devices run at 44100 Hz under MME, and they can be combined without error.

PaDeviceIndex inputDevice  = Pa_GetDefaultInputDevice();
PaDeviceIndex outputDevice = Pa_GetDefaultOutputDevice();

Using the system defaults also means the program respects whatever audio devices the user has configured in Windows Sound Settings, rather than hardcoding specific device indices that may change between sessions.

The Python Server

The Python server loads DeepFilterNet once at startup and then enters a loop, waiting for audio chunks from C++:

def process_chunk(audio_chunk: np.ndarray, model, df_state) -> np.ndarray:
    audio_tensor = torch.from_numpy(audio_chunk).unsqueeze(0)
    audio_48k    = resampler_up(audio_tensor)
    enhanced_48k = enhance(model, df_state, audio_48k)
    enhanced_44k = resampler_down(enhanced_48k)

    # Trim or pad to match original length exactly
    target_length = audio_chunk.shape[0]
    enhanced_np   = enhanced_44k.squeeze(0).numpy()
    if len(enhanced_np) > target_length:
        enhanced_np = enhanced_np[:target_length]
    elif len(enhanced_np) < target_length:
        enhanced_np = np.pad(enhanced_np, (0, target_length - len(enhanced_np)))

    return enhanced_np

The trim and pad step at the end is necessary because resampling a fractional ratio can produce output that is one or two samples longer or shorter than expected. Without this correction, the buffer sizes would drift over time and eventually cause a crash.

The C++ Processing Loop

The C++ side runs the processing loop on a background thread, leaving the main thread free to wait for user input. This separation is essential — if the processing loop blocked the main thread, the program could not be stopped cleanly.

The loop uses double buffering to ensure smooth playback. Rather than waiting for the current chunk to be processed before playing anything, it plays the previous clean chunk while waiting for the current one:

std::thread processingThread([&]() {
    std::vector<float> previousClean(FRAMES, 0.0f);

    while (running) {
        // Play previous clean chunk immediately
        state.outputBuffer = previousClean;
        state.cleanReady   = true;

        // Send current chunk to Python
        zmq::message_t request(state.inputBuffer.data(),
                                state.inputBuffer.size() * sizeof(float));
        zmqSocket.send(request, zmq::send_flags::none);

        // Wait for clean audio
        zmq::message_t reply;
        zmqSocket.recv(reply, zmq::recv_flags::none);

        // Store for next iteration
        memcpy(previousClean.data(), reply.data(), FRAMES * sizeof(float));
    }
});

This introduces exactly one chunk of deliberate delay — approximately 34 milliseconds — but guarantees that the audio callback always has something to play.

What the Pipeline Proved

Running the full pipeline processed over 1100 chunks continuously — more than 38 seconds of audio — without error or crash. Both terminals showed synchronized chunk counts, confirming that data was flowing correctly through ZeroMQ in both directions and that DeepFilterNet was processing every chunk.

C++    → Processed 1100 chunks (34.8 seconds)
Python → Processed 1100 chunks (38.3 seconds)

The small time difference between the two is the accumulated resampling and inference overhead on the Python side — expected and normal behavior.

The Remaining Challenge

DeepFilterNet on CPU takes approximately 50 to 100 milliseconds to process each chunk. The audio buffer period is 34 milliseconds. This means Python cannot always finish processing before the next chunk needs to be played, which causes gaps in the output audio.

This is not an architectural problem — the pipeline design is correct. It is a compute problem. Phase 4 addresses it by exporting DeepFilterNet to ONNX format and running it with ONNX Runtime, which can reduce inference time by a factor of three or more on CPU, bringing it within the real-time budget.

What This Phase Establishes

Phase 3 establishes the complete end-to-end pipeline. Audio flows from the microphone through C++, across a ZeroMQ socket to Python, through DeepFilterNet, back across the socket, and out to the speakers. Every component from Phase 1 and Phase 2 is now connected and working together.

The next phase focuses on making the pipeline fast enough that the processed audio is audible in real time.

Github Link: https://github.com/Dextromethorpan/Noise_Cancellation