Noise Cancellation in C++ and Rust: Phase 7 — Voice Activity Detection

April 29, 2026 · Luciano Muratore

The Remaining Problem After Phase 6

Phase 6 fixed the architecture. Audio flowed continuously, chunk boundaries were gone, voice was fully reproduced. But one problem remained: when not speaking, the system was not silent. Room noise — the hum of a computer fan, distant traffic, ambient room tone — leaked through the ERB mask and came out of the headphones.

This is the fundamental limitation of a noise suppression model running on imperfect features. The DeepFilterNet3 ERB decoder applies a gain mask to each frequency band, but if the features feeding the encoder are slightly wrong — as ours are, because the feature extraction is hand-rolled rather than ported exactly from the Python source — the mask gains are not aggressive enough. Some noise passes through.

For a remote communication use case, this matters. A colleague on the other end should hear silence when you are not speaking, not a continuous low-level room noise.

The lsnr Signal

The Phase 6 lsnr analysis provided the solution. The encoder outputs a value called lsnr — local signal-to-noise ratio — on every frame. It estimates how much useful signal is present in the current audio relative to the noise floor.

The structured silence/speech/silence experiment from Phase 6 showed a clean separation:

Silence:  lsnr in [-15, -10] dB
Speaking: lsnr in [+20, +35] dB

A threshold of +10 dB sits cleanly between both clusters with ample margin on both sides. This means lsnr can be used as a reliable Voice Activity Detection (VAD) signal without any additional processing.

The encoder already computes this value on every frame as part of its normal operation. Using it for VAD costs nothing — we just read a value we were already computing and ignoring.

The VAD Gate

The implementation is a gate that sits between the encoder and the two decoders:

if lsnr_val >= VAD_THRESHOLD_DB {
    // Voice detected — reset hold-off counter
    self.vad_holdoff = VAD_HOLDOFF_FRAMES;
} else if self.vad_holdoff > 0 {
    // Below threshold but within hold-off window — keep gate open
    self.vad_holdoff -= 1;
} else {
    // Gate closed — output silence, skip decoders
    return Ok(vec![0.0f32; HOP_SIZE]);
}

// Gate is open — run erb_dec and df_dec normally

When the gate is closed, the function returns a silent frame immediately without running erb_dec or df_dec. This has a secondary benefit: the two decoders are skipped during silence, slightly reducing average inference time.

The encoder always runs regardless of gate state. This is necessary because lsnr is computed by the encoder — you cannot know whether to open the gate without first asking the encoder.

The Hold-Off Counter

A naive threshold gate has a problem: during natural speech, there are brief silent gaps between syllables and words. The gap between “uno” and “dos” might be 80 ms. A gate that closes instantly would cut the tail of every word and the beginning of the next one, making speech sound clipped and unnatural.

The hold-off counter solves this. When lsnr exceeds the threshold, the counter is set to VAD_HOLDOFF_FRAMES. On subsequent frames where lsnr drops below the threshold, the counter decrements rather than the gate closing. Only when the counter reaches zero does the gate actually close.

const VAD_HOLDOFF_FRAMES: u32 = 75; // 750 ms at 10 ms per frame

750 milliseconds of hold-off means the gate stays open through any natural pause in speech up to three quarters of a second. This covers inter-word gaps, inter-syllable gaps, and brief hesitations without introducing noticeable delay when transitioning from speech to genuine silence.

The threshold and hold-off values are declared as named constants at the top of the file with comments explaining how they were derived:

/// VAD threshold derived from structured silence/speech/silence test (E05).
/// Silence clusters at -10 to -15 dB. Speech clusters at +20 to +35 dB.
/// +10 dB sits cleanly between both with ample margin.
const VAD_THRESHOLD_DB: f32 = 3.0;

/// Hold-off frames after voice is last detected.
/// Prevents gate from closing between syllables during fast speech.
/// Each frame = 10 ms @ 48 kHz. 75 frames = 750 ms.
const VAD_HOLDOFF_FRAMES: u32 = 75;

A Separate Binary

Phase 7 introduced a third binary rather than modifying the Phase 6 server. This keeps all three phases runnable independently for comparison:

# Phase 5 — per-chunk pipeline (reference)
[[bin]]
name = "noise_server"
path = "src/main.rs"

# Phase 6 — continuous stream pipeline
[[bin]]
name = "stream_server"
path = "src/stream_server.rs"

# Phase 7 — continuous stream + VAD (current)
[[bin]]
name = "stream_server_vad"
path = "src/stream_server_vad.rs"

The VAD-specific additions in stream_server_vad.rs are clearly marked with block comments so the diff between Phase 6 and Phase 7 is immediately visible to any reader of the source:

// ╔══════════════════════════════════════════════════════╗
// ║  PHASE 7 — NEW: Voice Activity Detection gate        ║
// ║                                                      ║
// ║  lsnr >= VAD_THRESHOLD_DB → voice detected           ║
// ║  lsnr <  VAD_THRESHOLD_DB → output silence           ║
// ╚══════════════════════════════════════════════════════╝

The Real-Time Visualizer

Phase 7 also introduced a standalone HTML visualizer (Noise_Monitor.html) that opens in any browser and shows two audio channels simultaneously in real time using the Web Audio API:

Left panel (red): microphone input — the noisy signal entering the system
Right panel (green): headphone output via Stereo Mix — the clean signal leaving the system
VAD indicator: energy bar showing gate open/closed state
Live metrics: RMS levels, suppression amount in dB, peak frequency of each channel, gate state

The visualizer captures the microphone directly and uses the Windows Stereo Mix loopback device to capture the headphone output. This makes the audio problems visible — the difference between the red and green spectrum panels shows exactly which frequencies are being suppressed and which are leaking through.

Results

After tuning the threshold to 3 dB and the hold-off to 75 frames:

Background noise during silence: eliminated
Fast speech reproduction: improved significantly, all words audible
High-frequency consonants: still somewhat muffled (root cause: feature extraction mismatch, not VAD)
Pipeline stability: maintained over extended sessions

The VAD gate solved the noise leakage problem cleanly. The remaining audio quality issues — muffled high-frequency consonants — are caused by the hand-rolled feature extraction not matching the training preprocessing exactly. That is the target for the next phase of improvement.

What the Visualizer Reveals

Running the visualizer while speaking shows the core remaining problem visually. The red spectrum (microphone) is rich across all frequencies, including high-frequency consonants above 4 kHz. The green spectrum (headphone output) shows energy concentrated in the low-to-mid range, with the high-frequency content largely absent.

This is the ERB mask being too aggressive on high-frequency bands — a direct consequence of the feature extraction mismatch. The model receives imprecise ERB features for the high bands and responds by suppressing them more than it should.

Making this visible was the goal of the visualizer. A problem you can see is a problem you can measure, and a problem you can measure is a problem you can fix.

Key Lesson

The encoder’s internal outputs are diagnostic tools, not just intermediate values to pass to the next model. lsnr was being computed on every frame from Phase 5 onwards and logged but ignored. Taking the time to analyse it properly — with a structured controlled experiment rather than a quick glance — revealed a clean, reliable signal that solved the noise leakage problem with a handful of lines of code.

Always examine what your models are already telling you before reaching for external solutions.

Github Link: https://github.com/Dextromethorpan/Noise_Cancellation