Can this be used to mute non speech parts of an audio? #441
-
Hi, I have a lot of narration done by myself for a tutorial that I made so I am trying to clean up the audio files to remove anything non speech related which is majority throat clearing, etc. Here is a very short sample: I tried whisperHallu but that had some issues cropping some words halfway. All I need is to keep only the speech parts. After this I will have to figure out a way to remove retakes which sometimes it's one word but sometimes it's half a sentence repeated multiple times but it's always the last one that would be kept. I see that this library has a remove_repetition method. I guess I have to specify a large max word argument like 100? In the end what I have to do is to keep the last instances but move them to the time of the first instances as that's where the speech of the correct retake should be. So using these timings unfortunately doesn't clean the throat clearing parts (marked in green). VAD visualization that uses Silero VAD like in the stable ts docs show this: So as you can see it actually eats into actual speech but doesn't really clear throat clearing parts. Would appreciate any tips on this. Any ideas? |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
I wrote this code and got this result: If you listen to it, you can hear all coughing, throat clearing is still there in the speech only output. Is there a way to filter those out? I thought they wouldn't be included in speech.
|
Beta Was this translation helpful? Give feedback.
-
Hi! Possible solution - to adjust
|
Beta Was this translation helpful? Give feedback.
-
Thanks a lot it did indeed seems to work better. Do you know when the issue might be fixed? Will it be the same method you just told me? EDIT: using return_seconds worked. I will try it on other audio files but so far this seems promising. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
Ok it looks like the precision of samples is a lot better. return seconds only returns one digit after decimal point but the samples to seconds conversion sometimes require multiple digits. So I guess return_seconds could be changed to return more precision? Otherwise now they seem to match. |
Beta Was this translation helpful? Give feedback.
-
Although when I use the actual audio that's 12 min long, the first word gaussian seems clipped but another word where I say "scalar" the "ss" is dropped so it sounds like "kaler": You can see this at 10:43. I checked both the save_audio output and also my mute function, they match but they both have the same issue. Is this because of the threshold value? I think it will be hard to find a value that works for all my recordings. Or do you think it's a bug? |
Beta Was this translation helpful? Give feedback.
-
You can change source code to increase precision (round method) Line 360 in b4b6f2a
Yes, in your case, i think you can also adjust
I believe we'll release a new version within six months |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot, the padding helped a bit, mostly for the scalar word but the gaussian word still seems abrupt like aussian. I will leave it for now because I have over 80 hours of audio so then it will take more time to see if the original audio also have an issue while editing and then having to use it, etc. But so far this library gave the most accurate results. |
Beta Was this translation helpful? Give feedback.
-
Hi, Using 0.98 for threshold worked for a lot of files but some files it doesn't detect the first word: I even tried 0.99 for threshold but same result. So not sure what else I can do besides waiting for the new version. I just wanted to post it for you as a sample but also in case you have other ideas. I don't mind fixing these by hand. When I used 0.99 I noticed some of the audio timings started to deviate more from the actual speech timings so I guess using higher threshold doesn't always yield more accuracy. |
Beta Was this translation helpful? Give feedback.
Hi!
Silero VAD sometimes triggers on human non-speech sounds, this issue will be fixed in future releases.
Possible solution - to adjust
threshold
parameter of theget_speech_timestamps
methodthreshold=0.97
works fine on the audio you provided.