Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it would be interesting to let the model make other speech sounds. like laughing #35

Open
Manni1000 opened this issue Feb 10, 2024 · 20 comments
Labels
feature request New feature or request

Comments

@Manni1000
Copy link

Manni1000 commented Feb 10, 2024

bark also did this and it is quite helpful.

we could use semantics like this for the sounds.
[laughter]
[laughs]
[sighs]
[gasps]
[clears throat]
— or ... for hesitations
♪ for song lyrics
CAPITALIZATION for emphasis of a word

maby other emotional words would also be interesting like sad / happy.

but it might be to much work. do you think it would be possible to add something like this though fine tuning?

@sidroopdaska
Copy link
Contributor

We will release finetuning code soon. Would love the community to push this work forward :)
And we are ofc happy to assist along the way.

@l4b4r4b4b4
Copy link
Contributor

And breathing. That scene from the movie "Her" 😍

@vatsalaggarwal
Copy link
Contributor

I've added some initial pointers to this here: #70 (comment)

@maepopi
Copy link

maepopi commented Mar 1, 2024

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!

By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

@l4b4r4b4b4
Copy link
Contributor

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!

By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

hmmm that could be one way to have the input include special token / words for those, or have a trainable preprocessing model insert them or simply have the TTS model learn them from the given audio sample. And I actually prefer the latter ;)

@maepopi
Copy link

maepopi commented Mar 1, 2024

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!
By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

hmmm that could be one way to have the input include special token / words for those, or have a trainable preprocessing model insert them or simply have the TTS model learn them from the given audio sample. And I actually prefer the latter ;)

Hey! I still consider myself as a novice in the field, what you mean is that we should be able to caption the audios with the given sounds (like "haha", "shh", "hhh") and then train the model with this? Because that's what I've tried with this repo here (which is really good by the way and which is based on Tortoise TTS). When you want to finetune a model there you put audios and a json file with the retranscription of your audios, and then in your audio you label the non verbal sounds to teach the model to recognize them. It actually works very well, but not for all sounds, and I've been struggling with the sigh for example. I've tried labelizing it like "haaa" or "hhhh", but it often gets confused with "shh" or "haha". That is why I was thinking that using [laugh] or [sigh] instead of trying a litteral phonetic retranscription might work better.

What do you think?

@vatsalaggarwal
Copy link
Contributor

Yeah, this would be great, and we would love to do this! We're focusing on a few more fundamental model improvements which would be hard for the community to manage, and I think folks over at #70 are close to having the finetuning working... We can try it with that once that is up and running!

It's hard to say how well these things would work without looking at the data first, but I reckon having special tokens for "laughter" / "sigh" / etc might work better than using a prompt like "haha" or "shh"... if someone can share the data they're thinking of training with to get this working, I can comment more :)

@maepopi
Copy link

maepopi commented Mar 4, 2024

I can cook a little sample with some sentences with non verbal sounds, to show how I've been training until now and write a version of how I think it would be better to train it! Don't know if it'll help but I can try putting this out this week :)

@vatsalaggarwal
Copy link
Contributor

@maepopi that would be awesome, and would help for sure!

@maepopi
Copy link

maepopi commented Mar 4, 2024

Do you need a specific amount of audios / JSON entries or just a few with an example of each non verbal sound is enough?

@vatsalaggarwal
Copy link
Contributor

To have a look at, a few examples should be enough... for training, we'll probably need more!

@maepopi
Copy link

maepopi commented Mar 4, 2024

Okay because my dataset comes from a video game / an audiobook and thus its not free of rights so I don't know if I can share it in full here

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Mar 4, 2024

feel free to email me vatsal@themetavoice.xyz with whatever you can share / if you can share!

@maepopi
Copy link

maepopi commented Mar 4, 2024

Ok thanks! I'll see what I can do!

@maepopi
Copy link

maepopi commented Mar 6, 2024

Hey @vatsalaggarwal, I have sent you a small dataset with two JSONS, one with phonetic retranscription, and another with token retranscription. As I said in my email, I'll sum up my thoughts here for the others to be able to jump in.

While retranscribing with tokens such as [sigh], [laugh], I have quickly noticed that sometimes it might actually be better to transcribe phonetically. I'm thinking about sentences such as :
"Ah, there you are!"

or

"Oh, really?"

Where "Ah" and "Oh" actually act more like words than non verbal sounds.

There are also a lot of cases where you might want some control on the sound you want to generate. For instance, there are sighs that are longer than others, or that convey a different feeling : nostalgia, pain, or boredom for instance. Likewise for "Hmm", which can convey thinking but also relishing something you're eating. In these cases, maybe it would be a good idea to give more nuance to the token, with options such as [pained sigh] or [nostalgic sigh], but that might result in confusing the model more, especially if it results in having just a few isolated examples in the whole dataset. Like you'd technically be able to have ten [sigh] tokens, but if you choose to distinguish between them, you might find yourself with ten new individual concepts to train, which won't have much representation elsewhere in the dataset . On the other hand, gathering all these sounds under the token [sigh] might result more often in generating an actual sigh, but you would lose a lot of control on the type of sigh you want.

Maybe a way to fix this would be to also add emotion tokens, such as [sad] or [happy], that sort of thing, and combine it with the non verbal token. That is something Tortoise-TTS model does. You can write [I am really sad,] at the beginning of your prompt, and it will result in the model trying to give a sad intonation to the generated sentence. For emotions or a loud intonation, I also tried to label the words in capital letters in Tortoise-TTS, and it seemed to work rather well, so that's something we could try and investigate upon as well. Maybe these could be LoRAS?

In the end, I think it might be best to make the model flexible to recognize both types of labeling : phonetic and token. This way, you can first try to train/ generate with a phonetic way, and if you see there's bleeding or the model doesn't capture the sound well, then you can try with tokens.

Anyway sorry if my thoughts are a bit messy, I just wanted to share them here as well because obviously all of this is very empiric on my side. Very excited to be part of the conversation though!

@vatsalaggarwal vatsalaggarwal added the feature request New feature or request label Mar 12, 2024
@vatsalaggarwal
Copy link
Contributor

Sorry for the delay here @maepopi ... @lucapericlp should have the finetuning code (on top of @danablend's) out today, and we can take it further once that is done!

@maepopi
Copy link

maepopi commented Mar 12, 2024

No problem at all! Keep us posted, can't wait to see where you're going with this :)

@lucapericlp
Copy link
Contributor

lucapericlp commented Mar 14, 2024

Hey @l4b4r4b4b4 @maepopi @Manni1000, we just released an initial approach for finetuning the last N transformer blocks of the first stage LLM. Just a note that it'd be best to play around with the hyperparams in finetune_params.py as we didn't determine the optimal set (some people from the community were keen to contribute this portion). Let us know if you have any issues or if you're up for contributing any improvements (via param sweep or otherwise)!

Next step to improve finetuning effectiveness is to have LoRA adapters for the first stage LLM which is being worked on here.

@maepopi
Copy link

maepopi commented Mar 14, 2024

Thank you so much! I'll try having a look at this this week end!

@kabachuha
Copy link

@lucapericlp does this approach support adding new tokens to the vocabulary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants