Loud white noise on short texts with embedding_scale > 1 #46

AWAS666 · 2023-11-20T22:09:57Z

If I give it a short text like "Hello how are you?" it generates me a 25 second clip of extremly loud white noise.
I tried this on the libritts model only, it also happens on my own model which I finetuned based on it.

AWAS666 · 2023-11-20T23:24:47Z

Nvm also seems to happen on single words, setting both alpha and beta to zero makes it return to normal though.

yl4579 · 2023-11-21T02:05:50Z

This is not supposed to occur. Setting alpha and beta to 0 means not using the diffusion model at all. What are your packages versions?

AWAS666 · 2023-11-21T15:37:20Z

Windows 10
Python 3.11.4
Torch 2.1.0+cu118

And loading the model to the GPU instead of CPU, but I'll do some further testing to narrow it down
This doesn't seem to make a difference...

yl4579 · 2023-11-21T16:34:54Z

Could you make a conda environment with Python 3.10 instead? You can run the colab demo and check package versions there and make sure you install these packages instead.

AWAS666 · 2023-11-21T16:49:23Z

Could you make a conda environment with Python 3.10 instead? You can run the colab demo and check package versions there and make sure you install these packages instead.

I tried 3.10 locally, but exactly the same issue.

AWAS666 · 2023-11-21T16:54:02Z

Tried it in collab aswell, if I put in just the word "wink" as inference text, it will give me bad white noise.

yl4579 · 2023-11-21T18:18:41Z

I think it could be due to not such a training sample during training. The model has never seen a single word during training (because we removed speech shorter than one second).

AWAS666 · 2023-11-21T18:37:15Z

Is there a way around it without retraining it, like dropping the diffusion model on short inputs?
Otherwise that likely means having to retrain, right?

yl4579 · 2023-11-21T18:56:29Z

You can add some filler words before or after the word you want to speak and cut the audio to only get the word you are interested in.

AWAS666 · 2023-11-21T19:02:50Z

not a pretty solution either :)
but at least it isnt my setup alone

easyrider · 2023-11-21T19:19:04Z

I got the same problem - it's probably not related to the sentence length but on some first words in the sentance like: "you", "me". For example 'You can do that too." [58] - produces white noise on Colab.

So for example if in the longer text there is a sentance starting with "You will need .." this will corrupt audio afterwards.
But as mentioned alpha = 0, beta = 0 fixes that.

yl4579 · 2023-11-21T20:04:48Z

@easyrider I have tried this You can do that too. on Colab and was able to synthesize the speech in any voice.
https://vocaroo.com/1f8Rpq84L8H4
https://vocaroo.com/110LoHbYIP9Y
https://vocaroo.com/155vjtpiSYLO
https://vocaroo.com/19lqIdQEM9uJ (LJSpeech)

yl4579 · 2023-11-21T20:07:49Z

@AWAS666 It still works even with a single word wink.. It did generate noise if there is no punctuation after this. I think this is caused by the training data again, where all sentences end with some sort of punctuation.
https://vocaroo.com/1728QKrk6PSU
https://vocaroo.com/1iptqIqXNtRj

AWAS666 · 2023-11-22T11:28:59Z

@AWAS666 It still works even with a single word wink.. It did generate noise if there is no punctuation after this. I think this is caused by the training data again, where all sentences end with some sort of punctuation. https://vocaroo.com/1728QKrk6PSU https://vocaroo.com/1iptqIqXNtRj

You are correct, adding the fullstop helps fix it.
But only if you have embedding scale at 1, as soon as you raise that, it does it again :)

yl4579 · 2023-11-22T17:12:24Z

This is a very interesting issue. During training the guidance scale is 1, and for some reason when the input is small it fails to generalize to higher guidance scale. I think probably during training we may have to vary the guidance scale randomly from 1 to 2 then? I will try to do this just for the 2nd stage and see if the problem disappears.

AWAS666 · 2023-11-22T18:40:43Z

Great to hear

fivestones · 2023-11-27T07:47:46Z

I have the same problem specifically for short sentences/phrases (all with puncutation) running on MacOS M2. I noticed that it seems to be more likely when the sentence length is less than about 40 characters. I was already doing TTS on longform audio, so I wrote a script that splits up sentences but if any sentence is less than 40 characters it attaches it to the previous or next sentence. This way every block of text I processed with StyleTTS2 is longer than 40 characters. That fixed the problem entirely for me. I didn't make any changes to punctuation and didn't change any of the words in the text.

Fix error when resuming training

yl4579 closed this as completed Nov 21, 2023

yl4579 reopened this Nov 21, 2023

yl4579 added the bug Something isn't working label Nov 22, 2023

Akito-UzukiP pushed a commit to Akito-UzukiP/StyleTTS2 that referenced this issue Jan 13, 2024

Update train_ms.py (yl4579#46)

85a467c

Fix error when resuming training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loud white noise on short texts with embedding_scale > 1 #46

Loud white noise on short texts with embedding_scale > 1 #46

AWAS666 commented Nov 20, 2023

AWAS666 commented Nov 20, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023 •

edited

yl4579 commented Nov 21, 2023 •

edited

AWAS666 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023

easyrider commented Nov 21, 2023 •

edited

yl4579 commented Nov 21, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 22, 2023

yl4579 commented Nov 22, 2023 •

edited

AWAS666 commented Nov 22, 2023

fivestones commented Nov 27, 2023

Loud white noise on short texts with embedding_scale > 1 #46

Loud white noise on short texts with embedding_scale > 1 #46

Comments

AWAS666 commented Nov 20, 2023

AWAS666 commented Nov 20, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023 • edited

yl4579 commented Nov 21, 2023 • edited

AWAS666 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 21, 2023

easyrider commented Nov 21, 2023 • edited

yl4579 commented Nov 21, 2023

yl4579 commented Nov 21, 2023

AWAS666 commented Nov 22, 2023

yl4579 commented Nov 22, 2023 • edited

AWAS666 commented Nov 22, 2023

fivestones commented Nov 27, 2023

AWAS666 commented Nov 21, 2023 •

edited

yl4579 commented Nov 21, 2023 •

edited

easyrider commented Nov 21, 2023 •

edited

yl4579 commented Nov 22, 2023 •

edited