Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: FACodec. How to recreate demo examples for voice conversion? #161

Open
Allessyer opened this issue Mar 18, 2024 · 9 comments
Open
Assignees

Comments

@Allessyer
Copy link

Problem Overview

I tried to recreate results from demo page for FACodec: Voice Conversion Samples, but results are worse then examples provided in demo page. Why is it so? And how to achieve the same quality as from demo page samples?

Steps Taken

  1. I used the code from here. Didn't change any parameters from Encoder and Decoder. Everything as provided in code examples.
  2. I downloaded 4 wav files for prompt and for source from demo page.

Expected Outcome

Results of voice conversion are worse then in examples.

Environment Information

  • Google Colab
@HeCheng0625
Copy link
Collaborator

Hi, which checkpoint are you using? You can follow:

from Amphion.models.codec.ns3_codec import FACodecEncoderV2, FACodecDecoderV2

# Same parameters as FACodecEncoder/FACodecDecoder
fa_encoder_v2 = FACodecEncoderV2(...)
fa_decoder_v2 = FACodecDecoderV2(...)

encoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_encoder_v2.bin")
decoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_decoder_v2.bin")

fa_encoder_v2.load_state_dict(torch.load(encoder_v2_ckpt))
fa_decoder_v2.load_state_dict(torch.load(decoder_v2_ckpt))

with torch.no_grad():
  enc_out_a = fa_encoder_v2(wav_a)
  prosody_a = fa_encoder_v2.get_prosody_feature(wav_a)
  enc_out_b = fa_encoder_v2(wav_b)
  prosody_b = fa_encoder_v2.get_prosody_feature(wav_b)

  vq_post_emb_a, vq_id_a, _, quantized, spk_embs_a = fa_decoder_v2(
      enc_out_a, prosody_a, eval_vq=False, vq=True
  )
  vq_post_emb_b, vq_id_b, _, quantized, spk_embs_b = fa_decoder_v2(
      enc_out_b, prosody_b, eval_vq=False, vq=True
  )

  vq_post_emb_a_to_b = fa_decoder_v2.vq2emb(vq_id_a, use_residual=False)
  recon_wav_a_to_b = fa_decoder_v2.inference(vq_post_emb_a_to_b, spk_embs_b)

@Approximetal
Copy link

Hi, which checkpoint are you using? You can follow:

from Amphion.models.codec.ns3_codec import FACodecEncoderV2, FACodecDecoderV2

# Same parameters as FACodecEncoder/FACodecDecoder
fa_encoder_v2 = FACodecEncoderV2(...)
fa_decoder_v2 = FACodecDecoderV2(...)

encoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_encoder_v2.bin")
decoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_decoder_v2.bin")

fa_encoder_v2.load_state_dict(torch.load(encoder_v2_ckpt))
fa_decoder_v2.load_state_dict(torch.load(decoder_v2_ckpt))

with torch.no_grad():
  enc_out_a = fa_encoder_v2(wav_a)
  prosody_a = fa_encoder_v2.get_prosody_feature(wav_a)
  enc_out_b = fa_encoder_v2(wav_b)
  prosody_b = fa_encoder_v2.get_prosody_feature(wav_b)

  vq_post_emb_a, vq_id_a, _, quantized, spk_embs_a = fa_decoder_v2(
      enc_out_a, prosody_a, eval_vq=False, vq=True
  )
  vq_post_emb_b, vq_id_b, _, quantized, spk_embs_b = fa_decoder_v2(
      enc_out_b, prosody_b, eval_vq=False, vq=True
  )

  vq_post_emb_a_to_b = fa_decoder_v2.vq2emb(vq_id_a, use_residual=False)
  recon_wav_a_to_b = fa_decoder_v2.inference(vq_post_emb_a_to_b, spk_embs_b)

Hi, I tried this code but the quality of the reconstructed wav seems to be poor, how should I adjust the parameters to get the best results?
FACodec_test.zip

@ATtendev
Copy link

same here

@HeCheng0625
Copy link
Collaborator

Hi, since our model is trained on 16KHz English data, vc performance in other languages may not be as good as shown on the demo page.

@ATtendev
Copy link

ATtendev commented Mar 21, 2024

Is that possible to train with a new language? And How can i do it?
thanks.
@HeCheng0625

@HeCheng0625
Copy link
Collaborator

Hi, you can train the codec with other languages if you have some aligned phonemes and waveforms.

@wosyoo
Copy link

wosyoo commented Mar 27, 2024

But now I use the English source and prompt provided by the demo page to generate zero-shot voice quality is worse than that of the demo page. May I ask why?

@RMSnow
Copy link
Collaborator

RMSnow commented Apr 2, 2024

Hi @wosyoo, could you attach your input and generated samples here?

@lumpidu
Copy link

lumpidu commented Apr 21, 2024

Hi, you can train the codec with other languages if you have some aligned phonemes and waveforms.

Would love to do this, how can I ? Haven't seen any training code so far .... and I need to say: in the target language I am using, the results with the pretrained models are really bad (Icelandic)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants