Bug report - Regression of VAD quality between 3.1 and 4.0 (speech detected on perfect silence) #396

Jeronymous · 2023-11-18T19:05:49Z

🐛 Bug

On some audio, the quality of the VAD is reallly worse in the latest version v4.0, compared to what it was in v3.1

More precisely, v4.0 detects speech on "quasi perfect" silent period:

left: v4.0
right: v3.1

Another user reports the same experience (and spotted v3 to me): linto-ai/whisper-timestamped#74 (comment)

Also, I have troubles to revert to v3.1. See comment here: linto-ai/whisper-timestamped#142 (comment)
Maybe I missed something to handle versioning with silero-vad.
Any help on this PR would be very much appreciated.

To Reproduce

Audio to reproduce : https://github.com/linto-ai/whisper-timestamped/files/11220341/jon.zip

snakers4 · 2023-11-19T02:46:16Z

It is a known issue with near zero signals.
This is not a bug, this is a feature (tm).

The newer VAD tries to suppress spurious activations with subtle speech in the background.
We are still thinking on how to fix this if possible at all.

Because suppressing some noise in the background and working on perfectly silent audios are mutually exclusive. In any case post-processing hyper-parameters should be tuned for each domain. Maybe a VAD should have a flag.

snakers4 · 2023-11-19T02:49:15Z

Yeah, looks like "perfect" silence:

x86Gr · 2023-12-03T11:14:47Z

It is a known issue with near zero signals. This is not a bug, this is a feature (tm).

The newer VAD tries to suppress spurious activations with subtle speech in the background. We are still thinking on how to fix this if possible at all.

Because suppressing some noise in the background and working on perfectly silent audios are mutually exclusive. In any case post-processing hyper-parameters should be tuned for each domain. Maybe a VAD should have a flag.

Since the noise is really near-zero, could this be prevented by a preprocessing filter cutting long parts below a set threshold, or -20dB below the average RMS?

snakers4 · 2023-12-05T08:16:01Z

To be solved with a V5 release

lixikun · 2024-01-04T02:50:09Z

To be solved with a V5 release

Great job!

trivikramak · 2024-01-18T04:40:37Z

Any updates on this?

DanyPell · 2024-03-31T04:01:23Z

Pumped for next version of silero_vad.onnx!

yuGAN6 · 2024-06-03T01:36:24Z

Since the noise is really near-zero, could this be prevented by a preprocessing filter cutting long parts below a set threshold, or -20dB below the average RMS?

you're right. I tried compute RMS per frame and set a threshold of RMS*1000 < 10 and it works well.

x86Gr · 2024-06-03T10:56:41Z

Since the noise is really near-zero, could this be prevented by a preprocessing filter cutting long parts below a set threshold, or -20dB below the average RMS?

you're right. I tried compute RMS per frame and set a threshold of RMS*1000 < 10 and it works well.

Good. With frame you mean the 512/1024/1536 samples?

Jeronymous added the bug Something isn't working label Nov 18, 2023

Jeronymous assigned snakers4 Nov 18, 2023

Jeronymous mentioned this issue Nov 23, 2023

Use silero v3.1 linto-ai/whisper-timestamped#142

Merged

snakers4 added the v5 Useful information for V5 release label Dec 5, 2023

TedTimbrell mentioned this issue May 16, 2024

Silero-VAD Meta Hallucinations SYSTRAN/faster-whisper#843

Open

IntendedConsequence mentioned this issue May 21, 2024

Fix the decoding issues ggerganov/whisper.cpp#1768

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug report - Regression of VAD quality between 3.1 and 4.0 (speech detected on perfect silence) #396

Bug report - Regression of VAD quality between 3.1 and 4.0 (speech detected on perfect silence) #396

Jeronymous commented Nov 18, 2023 •

edited

snakers4 commented Nov 19, 2023

snakers4 commented Nov 19, 2023

x86Gr commented Dec 3, 2023 •

edited

snakers4 commented Dec 5, 2023

lixikun commented Jan 4, 2024

trivikramak commented Jan 18, 2024

DanyPell commented Mar 31, 2024

yuGAN6 commented Jun 3, 2024

x86Gr commented Jun 3, 2024

Bug report - Regression of VAD quality between 3.1 and 4.0 (speech detected on perfect silence) #396

Bug report - Regression of VAD quality between 3.1 and 4.0 (speech detected on perfect silence) #396

Comments

Jeronymous commented Nov 18, 2023 • edited

🐛 Bug

To Reproduce

snakers4 commented Nov 19, 2023

snakers4 commented Nov 19, 2023

x86Gr commented Dec 3, 2023 • edited

snakers4 commented Dec 5, 2023

lixikun commented Jan 4, 2024

trivikramak commented Jan 18, 2024

DanyPell commented Mar 31, 2024

yuGAN6 commented Jun 3, 2024

x86Gr commented Jun 3, 2024

Jeronymous commented Nov 18, 2023 •

edited

x86Gr commented Dec 3, 2023 •

edited