it would be interesting to let the model make other speech sounds. like laughing #35

Manni1000 · 2024-02-10T02:00:15Z

bark also did this and it is quite helpful.

we could use semantics like this for the sounds.
[laughter]
[laughs]
[sighs]
[gasps]
[clears throat]
— or ... for hesitations
♪ for song lyrics
CAPITALIZATION for emphasis of a word

maby other emotional words would also be interesting like sad / happy.

but it might be to much work. do you think it would be possible to add something like this though fine tuning?

sidroopdaska · 2024-02-13T11:58:35Z

We will release finetuning code soon. Would love the community to push this work forward :)
And we are ofc happy to assist along the way.

l4b4r4b4b4 · 2024-02-15T10:18:39Z

And breathing. That scene from the movie "Her" 😍

vatsalaggarwal · 2024-02-21T17:10:26Z

I've added some initial pointers to this here: #70 (comment)

maepopi · 2024-03-01T15:09:12Z

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!

By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

l4b4r4b4b4 · 2024-03-01T17:36:24Z

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!

By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

hmmm that could be one way to have the input include special token / words for those, or have a trainable preprocessing model insert them or simply have the TTS model learn them from the given audio sample. And I actually prefer the latter ;)

maepopi · 2024-03-01T18:35:56Z

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!
By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

hmmm that could be one way to have the input include special token / words for those, or have a trainable preprocessing model insert them or simply have the TTS model learn them from the given audio sample. And I actually prefer the latter ;)

Hey! I still consider myself as a novice in the field, what you mean is that we should be able to caption the audios with the given sounds (like "haha", "shh", "hhh") and then train the model with this? Because that's what I've tried with this repo here (which is really good by the way and which is based on Tortoise TTS). When you want to finetune a model there you put audios and a json file with the retranscription of your audios, and then in your audio you label the non verbal sounds to teach the model to recognize them. It actually works very well, but not for all sounds, and I've been struggling with the sigh for example. I've tried labelizing it like "haaa" or "hhhh", but it often gets confused with "shh" or "haha". That is why I was thinking that using [laugh] or [sigh] instead of trying a litteral phonetic retranscription might work better.

What do you think?

vatsalaggarwal · 2024-03-04T13:48:27Z

Yeah, this would be great, and we would love to do this! We're focusing on a few more fundamental model improvements which would be hard for the community to manage, and I think folks over at #70 are close to having the finetuning working... We can try it with that once that is up and running!

It's hard to say how well these things would work without looking at the data first, but I reckon having special tokens for "laughter" / "sigh" / etc might work better than using a prompt like "haha" or "shh"... if someone can share the data they're thinking of training with to get this working, I can comment more :)

maepopi · 2024-03-04T13:53:21Z

I can cook a little sample with some sentences with non verbal sounds, to show how I've been training until now and write a version of how I think it would be better to train it! Don't know if it'll help but I can try putting this out this week :)

vatsalaggarwal · 2024-03-04T13:56:56Z

@maepopi that would be awesome, and would help for sure!

maepopi · 2024-03-04T14:01:32Z

Do you need a specific amount of audios / JSON entries or just a few with an example of each non verbal sound is enough?

vatsalaggarwal · 2024-03-04T14:08:58Z

To have a look at, a few examples should be enough... for training, we'll probably need more!

maepopi · 2024-03-04T14:10:03Z

Okay because my dataset comes from a video game / an audiobook and thus its not free of rights so I don't know if I can share it in full here

vatsalaggarwal · 2024-03-04T14:11:02Z

feel free to email me vatsal@themetavoice.xyz with whatever you can share / if you can share!

maepopi · 2024-03-04T14:12:29Z

Ok thanks! I'll see what I can do!

maepopi · 2024-03-06T12:10:58Z

Hey @vatsalaggarwal, I have sent you a small dataset with two JSONS, one with phonetic retranscription, and another with token retranscription. As I said in my email, I'll sum up my thoughts here for the others to be able to jump in.

While retranscribing with tokens such as [sigh], [laugh], I have quickly noticed that sometimes it might actually be better to transcribe phonetically. I'm thinking about sentences such as :
"Ah, there you are!"

or

"Oh, really?"

Where "Ah" and "Oh" actually act more like words than non verbal sounds.

There are also a lot of cases where you might want some control on the sound you want to generate. For instance, there are sighs that are longer than others, or that convey a different feeling : nostalgia, pain, or boredom for instance. Likewise for "Hmm", which can convey thinking but also relishing something you're eating. In these cases, maybe it would be a good idea to give more nuance to the token, with options such as [pained sigh] or [nostalgic sigh], but that might result in confusing the model more, especially if it results in having just a few isolated examples in the whole dataset. Like you'd technically be able to have ten [sigh] tokens, but if you choose to distinguish between them, you might find yourself with ten new individual concepts to train, which won't have much representation elsewhere in the dataset . On the other hand, gathering all these sounds under the token [sigh] might result more often in generating an actual sigh, but you would lose a lot of control on the type of sigh you want.

Maybe a way to fix this would be to also add emotion tokens, such as [sad] or [happy], that sort of thing, and combine it with the non verbal token. That is something Tortoise-TTS model does. You can write [I am really sad,] at the beginning of your prompt, and it will result in the model trying to give a sad intonation to the generated sentence. For emotions or a loud intonation, I also tried to label the words in capital letters in Tortoise-TTS, and it seemed to work rather well, so that's something we could try and investigate upon as well. Maybe these could be LoRAS?

In the end, I think it might be best to make the model flexible to recognize both types of labeling : phonetic and token. This way, you can first try to train/ generate with a phonetic way, and if you see there's bleeding or the model doesn't capture the sound well, then you can try with tokens.

Anyway sorry if my thoughts are a bit messy, I just wanted to share them here as well because obviously all of this is very empiric on my side. Very excited to be part of the conversation though!

vatsalaggarwal · 2024-03-12T11:20:00Z

Sorry for the delay here @maepopi ... @lucapericlp should have the finetuning code (on top of @danablend's) out today, and we can take it further once that is done!

maepopi · 2024-03-12T18:43:21Z

No problem at all! Keep us posted, can't wait to see where you're going with this :)

lucapericlp · 2024-03-14T13:45:36Z

Hey @l4b4r4b4b4 @maepopi @Manni1000, we just released an initial approach for finetuning the last N transformer blocks of the first stage LLM. Just a note that it'd be best to play around with the hyperparams in finetune_params.py as we didn't determine the optimal set (some people from the community were keen to contribute this portion). Let us know if you have any issues or if you're up for contributing any improvements (via param sweep or otherwise)!

Next step to improve finetuning effectiveness is to have LoRA adapters for the first stage LLM which is being worked on here.

maepopi · 2024-03-14T14:23:18Z

Thank you so much! I'll try having a look at this this week end!

kabachuha · 2024-04-25T15:10:19Z

@lucapericlp does this approach support adding new tokens to the vocabulary?

vatsalaggarwal added the feature request New feature or request label Mar 12, 2024

maepopi mentioned this issue Mar 23, 2024

Small issue in finetuning #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

it would be interesting to let the model make other speech sounds. like laughing #35

it would be interesting to let the model make other speech sounds. like laughing #35

Manni1000 commented Feb 10, 2024 •

edited

sidroopdaska commented Feb 13, 2024

l4b4r4b4b4 commented Feb 15, 2024

vatsalaggarwal commented Feb 21, 2024

maepopi commented Mar 1, 2024 •

edited

l4b4r4b4b4 commented Mar 1, 2024

maepopi commented Mar 1, 2024

vatsalaggarwal commented Mar 4, 2024

maepopi commented Mar 4, 2024 •

edited

vatsalaggarwal commented Mar 4, 2024

maepopi commented Mar 4, 2024 •

edited

vatsalaggarwal commented Mar 4, 2024

maepopi commented Mar 4, 2024

vatsalaggarwal commented Mar 4, 2024 •

edited

maepopi commented Mar 4, 2024

maepopi commented Mar 6, 2024 •

edited

vatsalaggarwal commented Mar 12, 2024

maepopi commented Mar 12, 2024

lucapericlp commented Mar 14, 2024 •

edited

maepopi commented Mar 14, 2024

kabachuha commented Apr 25, 2024

it would be interesting to let the model make other speech sounds. like laughing #35

it would be interesting to let the model make other speech sounds. like laughing #35

Comments

Manni1000 commented Feb 10, 2024 • edited

sidroopdaska commented Feb 13, 2024

l4b4r4b4b4 commented Feb 15, 2024

vatsalaggarwal commented Feb 21, 2024

maepopi commented Mar 1, 2024 • edited

l4b4r4b4b4 commented Mar 1, 2024

maepopi commented Mar 1, 2024

vatsalaggarwal commented Mar 4, 2024

maepopi commented Mar 4, 2024 • edited

vatsalaggarwal commented Mar 4, 2024

maepopi commented Mar 4, 2024 • edited

vatsalaggarwal commented Mar 4, 2024

maepopi commented Mar 4, 2024

vatsalaggarwal commented Mar 4, 2024 • edited

maepopi commented Mar 4, 2024

maepopi commented Mar 6, 2024 • edited

vatsalaggarwal commented Mar 12, 2024

maepopi commented Mar 12, 2024

lucapericlp commented Mar 14, 2024 • edited

maepopi commented Mar 14, 2024

kabachuha commented Apr 25, 2024

Manni1000 commented Feb 10, 2024 •

edited

maepopi commented Mar 1, 2024 •

edited

maepopi commented Mar 4, 2024 •

edited

maepopi commented Mar 4, 2024 •

edited

vatsalaggarwal commented Mar 4, 2024 •

edited

maepopi commented Mar 6, 2024 •

edited

lucapericlp commented Mar 14, 2024 •

edited