Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report - Regression of VAD quality between 3.1 and 4.0 (speech detected on perfect silence) #396

Open
Jeronymous opened this issue Nov 18, 2023 · 9 comments
Assignees
Labels
bug Something isn't working v5 Useful information for V5 release

Comments

@Jeronymous
Copy link

Jeronymous commented Nov 18, 2023

馃悰 Bug

On some audio, the quality of the VAD is reallly worse in the latest version v4.0, compared to what it was in v3.1

More precisely, v4.0 detects speech on "quasi perfect" silent period:
image

  • left: v4.0
  • right: v3.1

Another user reports the same experience (and spotted v3 to me): linto-ai/whisper-timestamped#74 (comment)

Also, I have troubles to revert to v3.1. See comment here: linto-ai/whisper-timestamped#142 (comment)
Maybe I missed something to handle versioning with silero-vad.
Any help on this PR would be very much appreciated.

To Reproduce

Audio to reproduce : https://github.com/linto-ai/whisper-timestamped/files/11220341/jon.zip

@Jeronymous Jeronymous added the bug Something isn't working label Nov 18, 2023
@snakers4
Copy link
Owner

It is a known issue with near zero signals.
This is not a bug, this is a feature (tm).

The newer VAD tries to suppress spurious activations with subtle speech in the background.
We are still thinking on how to fix this if possible at all.

Because suppressing some noise in the background and working on perfectly silent audios are mutually exclusive. In any case post-processing hyper-parameters should be tuned for each domain. Maybe a VAD should have a flag.

@snakers4
Copy link
Owner

Yeah, looks like "perfect" silence:

image

@x86Gr
Copy link

x86Gr commented Dec 3, 2023

It is a known issue with near zero signals. This is not a bug, this is a feature (tm).

The newer VAD tries to suppress spurious activations with subtle speech in the background. We are still thinking on how to fix this if possible at all.

Because suppressing some noise in the background and working on perfectly silent audios are mutually exclusive. In any case post-processing hyper-parameters should be tuned for each domain. Maybe a VAD should have a flag.

Since the noise is really near-zero, could this be prevented by a preprocessing filter cutting long parts below a set threshold, or -20dB below the average RMS?

@snakers4 snakers4 added the v5 Useful information for V5 release label Dec 5, 2023
@snakers4
Copy link
Owner

snakers4 commented Dec 5, 2023

To be solved with a V5 release

@lixikun
Copy link

lixikun commented Jan 4, 2024

To be solved with a V5 release

Great job!

@trivikramak
Copy link

Any updates on this?

@DanyPell
Copy link

Pumped for next version of silero_vad.onnx!

@yuGAN6
Copy link
Contributor

yuGAN6 commented Jun 3, 2024

Since the noise is really near-zero, could this be prevented by a preprocessing filter cutting long parts below a set threshold, or -20dB below the average RMS?

you're right. I tried compute RMS per frame and set a threshold of RMS*1000 < 10 and it works well.

@x86Gr
Copy link

x86Gr commented Jun 3, 2024

Since the noise is really near-zero, could this be prevented by a preprocessing filter cutting long parts below a set threshold, or -20dB below the average RMS?

you're right. I tried compute RMS per frame and set a threshold of RMS*1000 < 10 and it works well.

Good. With frame you mean the 512/1024/1536 samples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v5 Useful information for V5 release
Projects
None yet
Development

No branches or pull requests

7 participants