Noise Cancellation in C++ and Rust: Phase 6 — Continuous Stream Pipeline

April 29, 2026 · Luciano Muratore

The Problem with Phase 5

Phase 5 delivered something significant — three ONNX models orchestrated correctly, real noise suppression working, voice clearly audible through the headphones. But listening carefully revealed two persistent problems.

The first was a recurring glitch. Every 35 milliseconds, at each chunk boundary, there was a brief shshshsh artifact — a discontinuity in the audio that made continuous speech sound fragmented. Say a long vowel like “aaaaaaa” and you could hear it chopped into pieces.

The second was voice loss. Speaking a full sentence, only about 85% of it came through. The rest was silence. Fast speech was particularly affected — syllables would simply disappear.

Both problems had the same root cause.


Why Per-Chunk Processing Breaks

The Phase 5 architecture processed each incoming chunk independently:

chunk in (1536 samples @ 44100 Hz)

    ▼ resample → ~1672 samples @ 48000 Hz

    ▼ split into 480-sample hops (3 complete hops = 1440 samples)

    ▼ process each hop

    ▼ resample output back → ~1307 samples @ 44100 Hz

    ▼ pad with zeros to 1536 samples

chunk out

The mathematics are against this approach. Upsampling 1536 samples from 44100 Hz to 48000 Hz produces approximately 1672 samples. Dividing by the hop size of 480 gives 3.48 — not a whole number. Only 3 complete hops can be processed. The remaining 232 samples are discarded, and the output is padded with zeros to reach the required 1536 samples. That padding is silence, and it accounts for the 15% voice loss.

The chunk boundary glitch has a different but related cause. The STFT inside DFState maintains internal overlap-add buffers between frames. These buffers carry state from one frame to the next — they are what makes the windowed FFT produce a continuous output rather than blocky segments. When Phase 5 restarted the processing loop for each new chunk, these buffers were effectively reset, creating a discontinuity at every boundary.


The Ring Buffer Solution

Phase 6 replaces the per-chunk approach with four continuous buffers that persist across the entire session:

ZeroMQ PULL
    │ 1536 samples @ 44100 Hz

[proc_buf_48] ← upsampled samples accumulate here continuously
    │ drain 480 at a time

process_frame() — enc → erb_dec → df_dec → ISTFT
    │ 480 enhanced samples @ 48000 Hz

[output_buf_44] ← downsampled samples accumulate here
    │ drain 1536 when ready

ZeroMQ PUSH

The key principle is that nothing resets between chunks. The upsampler runs continuously, feeding proc_buf_48. The processing loop drains complete 480-sample hops from proc_buf_48 whenever they are available. The output goes through the downsampler continuously into output_buf_44. When output_buf_44 has accumulated 1536 samples, they are drained and sent back to the C++ engine.

The leftover samples from one chunk — the 232 that Phase 5 discarded — now simply remain in proc_buf_48 and are processed at the start of the next chunk. Nothing is ever thrown away. Nothing is ever padded with zeros.

The STFT buffers inside DFState flow naturally from one frame to the next because the processing never stops between chunks. The boundary no longer exists from the DSP’s perspective.


Implementation

The core of Phase 6 is the push_chunk method on StreamProcessor:

fn push_chunk(&mut self, input: &[f32]) -> Result<Option<Vec<f32>>> {
    // Upsample 44100 → 48000 continuously
    let up_out = self.up.process(&[input.to_vec()], None)?;
    self.proc_buf_48.extend_from_slice(&up_out[0]);

    // Process all complete 480-sample hops
    while self.proc_buf_48.len() >= HOP_SIZE {
        let hop: Vec<f32> = self.proc_buf_48.drain(..HOP_SIZE).collect();
        let enhanced = self.process_frame(&hop)?;
        let down_out = self.down.process(&[enhanced], None)?;
        self.output_buf_44.extend_from_slice(&down_out[0]);
    }

    // Return exactly FRAMES_44 samples when ready
    if self.output_buf_44.len() >= FRAMES_44 {
        let out: Vec<f32> = self.output_buf_44.drain(..FRAMES_44).collect();
        Ok(Some(out))
    } else {
        Ok(Some(vec![0.0f32; FRAMES_44]))
    }
}

The resamplers (self.up and self.down) are created once at startup and never recreated. The DFState inside process_frame is the same instance across the entire session. Both carry their internal state continuously without interruption.

The model orchestration inside process_frame is identical to Phase 5 — enc, erb_dec, df_dec in sequence. The only change is how audio is fed into it and collected from it.


Diagnosing with lsnr

Phase 6 introduced structured logging of the lsnr (local signal-to-noise ratio) output from the encoder. This value is computed by the encoder on every frame and estimates how much signal is present relative to noise.

Running the server with RUST_LOG=debug and conducting a controlled experiment — two minutes of silence, two minutes of speech, two minutes of silence — revealed a clear pattern:

Silence:  lsnr clusters at -10 to -15 dB consistently
Speaking: lsnr clusters at +20 to +35 dB consistently

This was not obvious from shorter tests where random room noise caused the values to appear erratic. The structured test showed that lsnr is a reliable voice activity signal with a clean separation at approximately +10 dB. This finding directly motivated Phase 7.


Results

Running the Phase 6 server (stream_server.exe) over 8250 chunks — approximately 4.7 minutes — produced:

  • Drop rate: ~0%
  • Average inference: ~11ms
  • proc_buf size: consistently 0 (no accumulation backlog)
  • out_buf size: oscillating between 600 and 1500 (healthy steady-state flow)
  • Long vowel test: all sound heard, no dropout
  • Full sentence test: all words reproduced

The chunk boundary glitch was gone. The voice loss was gone. The model orchestration had not changed — only the architecture around it.


Key Lesson

Stateful DSP and per-chunk processing are fundamentally incompatible. The STFT overlap-add mechanism requires continuity across frames. Any architecture that treats each chunk as an independent unit will produce boundary artifacts regardless of how correct the model inference is.

The ring buffer pattern — accumulate input continuously, process as data becomes available, accumulate output continuously — is the correct architecture for real-time audio processing with stateful DSP. It should have been the starting point, not the result of debugging.

Github Link: https://github.com/Dextromethorpan/Noise_Cancellation