❓ VAD robustness to noise-only signals in ONNX v3 vs. v4 models #369

cassiotbatista · 2023-09-11T12:55:34Z

Hello!

First of all thanks for the VAD model, it is great and really helpful!

I've been doing some experiments with the 16 kHz ONNX models in order to establish a baseline on noisy-speech as well as on non-speech-at-all data. Results on the former for both Ava Speech and LibriParty datasets seem to be in accordance with the quality metrics section of Silero's wiki: v4 is indeed better than v3.

However, for noise-only signals, I've been getting a consistent 2-3x worse result from v4 w.r.t. v3 on ESC-50, UrbanSound8k and FSD50K. This is concerning especially in a always-on scenario (let's say a "wild" one) where the VAD is used a pre-processing front-end to avoid calling a more power-hungry system (which is often the case.)

The following table shows the values for the error rate metric, namely 1-acc, where acc is sklearn's acuracy_score, so lower means better, and the best are highlighted in bold. The numbers being measured are the sigmoid'ed output of both models' forward method (early returned from get_speech_timestamps() utility), with threshold of 0.5 and window size of 1536 samples.

dataset	silero py v3	silero py v4
ava speech	0.2094	0.1545
libriparty	0.1610	0.0576
esc50	0.0407	0.1291
urbansound8k	0.0829	0.2444
fsd50k	0.0640	0.1120

I'm sharing the same uttids of the files I've been using in my experiments. It is not exactly ready to go because I resegmented and dumped some resampled versions of the datasets to disk, but I believe it should be useful and even reproducible if necessary. The format is uttid,bos,eos,label, where BOS and EOS are start and end of speech segments. The value -1 in those fields means there's no speech segment at all 😄

test_files.tar.gz

My environment:

Python: 3.9.17
PyTorch: 2.0.1+cpu
Torchaudio: 2.0.2+cpu
ONNX: 1.14.1
ONNX Runtime: 1.15.1

Finally, some questions:

Did you also observe this behaviour on non-speech-only data?
Could this be due to v4's encoder shrinking w.r.t. v3's (see number of params below from JIT)? Or should this be more of a training-data issue?
Does these numbers make sense at all? Am I doing something wrong? If so, I'd appreciate some directions.

125,144 silero_vad.jit (v3)
53520 encoder
130 decoder
4934 first_layer
66560 lstm
--
90,141 silero_vad.jit (v4)
13680 encoder
66625 decoder
9836 first_layer

Thanks!

The text was updated successfully, but these errors were encountered:

snakers4 · 2023-09-12T05:32:48Z

Hi!

This is definetely an interesting area to cover for v5, we did not really think about it before explicitly!
You see, we viewed VAD as speech / noised speech separation from everything else (silence, mild noise, music).

This poses a quesion of separating speech from extremely noisy backgrounds, if I understand correctly. Or when there always is noise and only sometimes speech.

However, for noise-only signals, I've been getting a consistent 2-3x worse result from v4 w.r.t. v3
Could this be due to v4's encoder shrinking w.r.t. v3's (see number of params below from JIT)? Or should this be more of a training-data issue?

We simply did not optimize for this metric, so it is more or less random.
But our data construction prefers mild noise and more or less clean speech.
In a nutshell, we simply did not optimize for this scenario.

Did you also observe this behaviour on non-speech-only data?

We observed that for very loud noise our VAD behaves not very well.

Does these numbers make sense at all? Am I doing something wrong? If so, I'd appreciate some directions.
The numbers being measured are the sigmoid'ed output of both models' forward method (early returned from get_speech_timestamps() utility), with threshold of 0.5 and window size of 1536 samples.

Yes, this makes sense.
There are a lot of gimmicks in the get speech timestamps method to make speech detection more robust.
We will try to (i) replicate your metrics (ii) see if applying more of the above method will improve the results (iii) adopt the task long-term.

The good news also is that we got a bit of support for our project, so it will enjoy some attention in the near future with regard to customization, generalization and flexibility.

cassiotbatista · 2023-09-12T21:55:40Z

Hello!

Thank you for your response, @snakers4.

This poses a quesion of separating speech from extremely noisy backgrounds, if I understand correctly. Or when there always is noise and only sometimes speech.

Yes, it is not exactly "detecting speech", but "not triggering on non-speech" instead. What I had in mind is slightly related to the latter. Something like idle periods of a ASR-based dictation application, in which the VAD is always on: to my mind, v4 would trigger - say - twice as often as v3 for background noises (such as a dog barking), which in turn might leave the ASR exposed. For IoT applications, on the other hand, it also means unecessarily calling a power-hungrier system more frequently.

We simply did not optimize for this metric, so it is more or less random.

Ok, got it!

There are a lot of gimmicks in the get speech timestamps method to make speech detection more robust.

In fact, I only used the windowing and forward call from get_speech_segments(), and posed an evaluation after the binarization step only at the model output posteriors, not at the timestamps. Perhaps I should continue the tests at the segment level (e.g., the best model should have the lowest sum of duration of wrongly-detected speech segments for noise-only data), even though I believe v4 would still present a worse behaviour, but maybe not on the same 2-3x proportion.

In any case, while waiting for - and looking forward to - v5, if you would be so nice to report the attempts to replicate such numbers in that table, I'll be happy to hear!

IntendedConsequence · 2023-10-18T13:32:43Z

This sounds like something related to my experience as well. After using v4 for a while I had to come back to v3. While overall speech detection seemed a bit better in v4 and more precise near word boundaries, it however exhibits a consistent tendency for false positives - long durations of non-speech (1-2 minutes) at the beginning and end of audio files are mistakenly flagged as having speech. For my uses this isn't worth a minor accuracy increase, I can simply increase padding between speech segments.

Now I'm not ruling out a mistake in my code, and I have never tested it formally, but subjectively it seems like it might be related to this issue.

dgoryeo · 2023-10-21T11:47:53Z

@IntendedConsequence , juts a quick novice question: how does one envokes v3 model? Thanks.

IntendedConsequence · 2023-10-29T04:46:26Z

@dgoryeo I'm not sure what to tell you. I don't use python for silero v3/v4 anymore, just onnxruntime C api. If I were you I guess I would start by checking out an older repository revision before v4 update? https://github.com/snakers4/silero-vad/tree/v3.1

snakers4 · 2023-11-23T06:04:44Z

We have finally been able to start work on V5 using this data, among others.

Jellun · 2023-11-23T06:13:36Z

That’s great news. Great to know that V5 is being worked on. On 23 Nov 2023, at 5:05 pm, Alexander Veysov ***@***.***> wrote: We have finally been able to start work on V5 using this data, among others. — Reply to this email directly, view it on GitHub<#369 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE3EASBYEQQ5T5SPAKLVOJTYF3RQVAVCNFSM6AAAAAA4TJFCJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRTHA2TOOBRHA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

snakers4 · 2023-12-05T08:16:43Z

To be solved with a V5 release.

rishikksh20 · 2024-04-23T07:24:31Z

@snakers4 Can we fine-tune VAD on our own data ? We have our in house segmented data just like to ask is it possible to fine tune this model or not.
I am not able to find any finetuning code in this repo.

cassiotbatista added the help wanted Extra attention is needed label Sep 11, 2023

cassiotbatista assigned snakers4 Sep 11, 2023

snakers4 added the v5 Useful information for V5 release label Dec 5, 2023

snakers4 mentioned this issue Apr 22, 2024

⚠️Public pre-test of Silero-VAD v5 #448

Open

IntendedConsequence mentioned this issue May 21, 2024

Fix the decoding issues ggerganov/whisper.cpp#1768

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ VAD robustness to noise-only signals in ONNX v3 vs. v4 models #369

❓ VAD robustness to noise-only signals in ONNX v3 vs. v4 models #369

cassiotbatista commented Sep 11, 2023

snakers4 commented Sep 12, 2023

cassiotbatista commented Sep 12, 2023

IntendedConsequence commented Oct 18, 2023

dgoryeo commented Oct 21, 2023

IntendedConsequence commented Oct 29, 2023

snakers4 commented Nov 23, 2023

Jellun commented Nov 23, 2023 via email

snakers4 commented Dec 5, 2023

rishikksh20 commented Apr 23, 2024

❓ VAD robustness to noise-only signals in ONNX v3 vs. v4 models #369

❓ VAD robustness to noise-only signals in ONNX v3 vs. v4 models #369

Comments

cassiotbatista commented Sep 11, 2023

snakers4 commented Sep 12, 2023

cassiotbatista commented Sep 12, 2023

IntendedConsequence commented Oct 18, 2023

dgoryeo commented Oct 21, 2023

IntendedConsequence commented Oct 29, 2023

snakers4 commented Nov 23, 2023

Jellun commented Nov 23, 2023 via email

snakers4 commented Dec 5, 2023

rishikksh20 commented Apr 23, 2024