Examples of a good fine-tune? #65

addytheyoung · 2023-11-22T20:10:59Z

addytheyoung
Nov 22, 2023

Does anyone have an example of a good fine-tuned styletts2 model?

The only one I can find is the LJSpeech model, which sounds really good! But wondering what some other narrators / speakers would sound like, especially voices more outside the training dataset. Thanks, and awesome work on this.

Shiro836 · 2023-11-24T15:35:16Z

Shiro836
Nov 24, 2023

https://www.youtube.com/watch?v=Tuz7_7q0Pr0 Trained on this interview: https://www.youtube.com/watch?v=ozOoONmJ9EQ

16 replies

GUUser91 Apr 6, 2024

@Shiro836
I get this error when I used your script.

ImportError: 'speechbrain' must be installed to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings. Visit https://speechbrain.github.io/ for installation instructions.

GUUser91 Apr 26, 2024

If anyone has a graphics card with 24gb vram like the 7900 xtx or 4090, then it's possible to set the max_len settings to 280 if anyone has Linux. Just use virtual console mode and close any program eating up vram with nvtop.

GUUser91 Apr 27, 2024

@borrero-c
How were you able to do adversarial training? I followed the instructions by clicking on your link. I set the joint_epoch epoch to 48 so it can start on epoch 49 and adversarial training never started. I also followed the instructions on this link and adversarial training never started.
#227 (comment)

borrero-c Apr 27, 2024

@GUUser91 how many epochs are you training for? Are you doing two separate sessions or one?

You can train adversarial by using the diffusion trained model as the base model in the config.yml file, how are you doing it?

GUUser91 Apr 27, 2024

@borrero-c
I set the epoch to 80 for training. This is a seperate session. I'm using the diffusion checkpoint as the base model. I set joint_epoch each to 48 and then 49. Here is a screenshot of the training.

Edit:
I tinkered around with the config_ft.yml file. I set Max_Len to 120. I set batch_percentage to 1. I set slmadv_params min_len to 100 and slmadv_params max_len to 120. Batch size is set to 2. Now the DiscLM and GenLM Loss stats are no longer at 0. I'm using a rx 7900 xtx.

Here a pic from my tensorboard folder.

yl4579 · 2023-11-25T20:54:40Z

yl4579
Nov 25, 2023
Maintainer

This is the results I got using default config for 50 epochs (past joint_epoch) and one hour of data: https://vocaroo.com/1aC4vr4jErDL
If we do not run SLM adversarial training part (stop before joint_epoch) it is slightly worse: https://vocaroo.com/1hxkfwlrhowS

SLM adversarial training is the most VRAM consuming part. I don't know how to mitigate this problem and fit it in smaller machines. Maybe techniques used for LLM finetuning could help as we are working with large speech language models here.

0 replies

78Alpha · 2024-01-12T03:15:47Z

78Alpha
Jan 12, 2024

Here are the results I have from 2 different models.

Aurora: 50 Epochs with joint training after 10, 8 Hours of audio, single voice. Batch Size 2, max length 220.
Chaos: 50 Epochs with joint training after 10, 10 hours of audio, 5 Hours of British audio for accent, 5 hours of depressed voice for the emotion. Batch Size 2, max length 220.

AuroraTest1.webm
AuroraTest2.webm
ChaosTest1.webm
ChaosTest2.webm

3 replies

godspirit00 Apr 28, 2024

Hello, could you please share your config? Did SLM train during your finetuning? I was having a problem that even with joint_epoch small than epochs, SLM training did not start during the entire finetuning process. I wonder what was wrong. So it'd be great if you can share something about your finetuning experiment. Thanks.

78Alpha May 4, 2024

No joint training was done because I left batch percentage at 0.5. It needs to be at least 1 if you're using batch size 2.

godspirit00 May 4, 2024

I tried with batch percentage 1 and batch size 2, but it didn't start either.

jonathandasilvasantos · 2024-04-10T01:13:06Z

jonathandasilvasantos
Apr 10, 2024

Fine-tuning on LibriTTS using a single Brazilian Portuguese speaker involved processing approximately 24 hours of audio over 60 epochs.

Link: https://drive.google.com/file/d/1pBqHbIuuaO7jvMsnnpbjrsFAPcHZKr41/view?usp=sharing

I'm using PL-BERT multilingual.

Please, any idea why there is this annoying noise on the end of the audio clip?

Thanks!

Jonathan S. Santos

0 replies

traderpedroso · 2024-05-23T14:10:39Z

traderpedroso
May 23, 2024

Fine-tuning on LibriTTS using a single Brazilian Portuguese speaker involved processing approximately 24 hours of audio over 60 epochs.

Link: https://drive.google.com/file/d/1pBqHbIuuaO7jvMsnnpbjrsFAPcHZKr41/view?usp=sharing

I'm using PL-BERT multilingual.

Please, any idea why there is this annoying noise on the end of the audio clip?

Thanks!

Jonathan S. Santos

Acredito que por falta de um pad de silêncio de pelo menos 400ms outra coisa se os áudios estiverem maior que o length faça o cálculo dos segundos e a frequência não tentei treinar ainda em português assim que concluir os LLM vou liberar um checkpoint em português se puder compartilhar seu check point

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples of a good fine-tune? #65

{{title}}

Replies: 5 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Examples of a good fine-tune? #65

Replies: 5 comments · 19 replies

yl4579 Nov 25, 2023 Maintainer

Replies: 5 comments 19 replies

yl4579
Nov 25, 2023
Maintainer