Can this be used to mute non speech parts of an audio? #441

orionflame · 2024-04-07T07:34:44Z

orionflame
Apr 7, 2024

Hi,

I have a lot of narration done by myself for a tutorial that I made so I am trying to clean up the audio files to remove anything non speech related which is majority throat clearing, etc. Here is a very short sample:
https://www.dropbox.com/scl/fi/kotmse874x4rsi86kr8f8/voice3.mp3?rlkey=l5m56g5axort1ru70goo3rvch&dl=1

I tried whisperHallu but that had some issues cropping some words halfway.

All I need is to keep only the speech parts. After this I will have to figure out a way to remove retakes which sometimes it's one word but sometimes it's half a sentence repeated multiple times but it's always the last one that would be kept.

I see that this library has a remove_repetition method. I guess I have to specify a large max word argument like 100? In the end what I have to do is to keep the last instances but move them to the time of the first instances as that's where the speech of the correct retake should be.

So using these timings unfortunately doesn't clean the throat clearing parts (marked in green).

VAD visualization that uses Silero VAD like in the stable ts docs show this:

So as you can see it actually eats into actual speech but doesn't really clear throat clearing parts.

Would appreciate any tips on this.

Any ideas?

Answered by adamnsandle

Apr 9, 2024

Hi!
Silero VAD sometimes triggers on human non-speech sounds, this issue will be fixed in future releases.

Possible solution - to adjust threshold parameter of the get_speech_timestamps method

threshold=0.97 works fine on the audio you provided.

View full answer

orionflame · 2024-04-07T08:50:09Z

orionflame
Apr 7, 2024
Author

I wrote this code and got this result:

https://www.dropbox.com/scl/fi/ipf441er11lp2hp0mkfwr/voice3_SPEECH.mp3?rlkey=fmz2wg0k1dg31oyi09bt7zkbi&dl=1

If you listen to it, you can hear all coughing, throat clearing is still there in the speech only output. Is there a way to filter those out? I thought they wouldn't be included in speech.

import os

from pydub import AudioSegment
from pydub.utils import mediainfo


import torch
torch.set_num_threads(1)

from pprint import pprint

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True)

#(get_speech_timestamps, _, read_audio, *_) = utils

(get_speech_timestamps,
 save_audio,
 read_audio,
 VADIterator,
 collect_chunks) = utils

# Specify the correct path to your audio file here
sampling_rate = 16000  # also accepts 8000



# Process all audio files in a directory
audio_directory = "path/to/audio"
for filename in os.listdir(audio_directory):
    if filename.endswith(".mp3"):
        audio_file = os.path.join(audio_directory, filename)

        file_name_without_extension, file_extension = os.path.splitext(filename)
        speech_only_filename = f"{file_name_without_extension}_SPEECH{file_extension}"
        speechonlyfile = os.path.join(audio_directory, speech_only_filename)

        wav = read_audio(audio_file, sampling_rate=sampling_rate)
        speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=sampling_rate)
        print(f"Processing: {audio_file}")

        save_audio(speechonlyfile, collect_chunks(speech_timestamps, wav), sampling_rate=sampling_rate)

0 replies

adamnsandle · 2024-04-09T04:50:31Z

adamnsandle
Apr 9, 2024
Collaborator

Hi!
Silero VAD sometimes triggers on human non-speech sounds, this issue will be fixed in future releases.

Possible solution - to adjust threshold parameter of the get_speech_timestamps method

threshold=0.97 works fine on the audio you provided.

0 replies

orionflame · 2024-04-09T05:17:48Z

orionflame
Apr 9, 2024
Author

Thanks a lot it did indeed seems to work better. Do you know when the issue might be fixed? Will it be the same method you just told me?

EDIT: using return_seconds worked. I will try it on other audio files but so far this seems promising.

Thanks again!

0 replies

orionflame · 2024-04-09T06:05:33Z

orionflame
Apr 9, 2024
Author

Ok it looks like the precision of samples is a lot better. return seconds only returns one digit after decimal point but the samples to seconds conversion sometimes require multiple digits. So I guess return_seconds could be changed to return more precision? Otherwise now they seem to match.

0 replies

orionflame · 2024-04-09T06:23:12Z

orionflame
Apr 9, 2024
Author

Although when I use the actual audio that's 12 min long, the first word gaussian seems clipped but another word where I say "scalar" the "ss" is dropped so it sounds like "kaler":

https://www.dropbox.com/scl/fi/old246d37lrjlvr6h2h9c/curvature-21.mp3?rlkey=1j0u9aq3wd00cyd5nmndce4wf&dl=1

You can see this at 10:43. I checked both the save_audio output and also my mute function, they match but they both have the same issue.

Is this because of the threshold value? I think it will be hard to find a value that works for all my recordings.

Or do you think it's a bug?

0 replies

adamnsandle · 2024-04-09T07:29:33Z

adamnsandle
Apr 9, 2024
Collaborator

So I guess return_seconds could be changed to return more precision?

You can change source code to increase precision (round method)

silero-vad/utils_vad.py

Line 360 in b4b6f2a

speech_dict['start'] = round(speech_dict['start'] / sampling_rate, 1)

Is this because of the threshold value?

Yes, in your case, i think you can also adjust speech_pad_ms parameters to preserve more speech on borders

Do you know when the issue might be fixed?

I believe we'll release a new version within six months

0 replies

orionflame · 2024-04-09T12:35:07Z

orionflame
Apr 9, 2024
Author

Thanks a lot, the padding helped a bit, mostly for the scalar word but the gaussian word still seems abrupt like aussian. I will leave it for now because I have over 80 hours of audio so then it will take more time to see if the original audio also have an issue while editing and then having to use it, etc. But so far this library gave the most accurate results.

0 replies

orionflame · 2024-04-12T13:09:02Z

orionflame
Apr 12, 2024
Author

Hi,

Using 0.98 for threshold worked for a lot of files but some files it doesn't detect the first word:
https://www.dropbox.com/scl/fi/sd6yrrty3inxhooqjojr3/curvature-49.mp3?rlkey=2z32ee2jlokij0p9qvsau9eb1&dl=1

I even tried 0.99 for threshold but same result. So not sure what else I can do besides waiting for the new version. I just wanted to post it for you as a sample but also in case you have other ideas.

I don't mind fixing these by hand.

When I used 0.99 I noticed some of the audio timings started to deviate more from the actual speech timings so I guess using higher threshold doesn't always yield more accuracy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can this be used to mute non speech parts of an audio? #441

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can this be used to mute non speech parts of an audio? #441

orionflame Apr 7, 2024

Replies: 8 comments

orionflame Apr 7, 2024 Author

adamnsandle Apr 9, 2024 Collaborator

orionflame Apr 9, 2024 Author

orionflame Apr 9, 2024 Author

orionflame Apr 9, 2024 Author

adamnsandle Apr 9, 2024 Collaborator

orionflame Apr 9, 2024 Author

orionflame Apr 12, 2024 Author

orionflame
Apr 7, 2024

orionflame
Apr 7, 2024
Author

adamnsandle
Apr 9, 2024
Collaborator

orionflame
Apr 9, 2024
Author

orionflame
Apr 9, 2024
Author

orionflame
Apr 9, 2024
Author

adamnsandle
Apr 9, 2024
Collaborator

orionflame
Apr 9, 2024
Author

orionflame
Apr 12, 2024
Author