Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About batch inference for multi-speakers #70

Closed
isjwdu opened this issue Apr 25, 2024 · 1 comment · Fixed by #74
Closed

About batch inference for multi-speakers #70

isjwdu opened this issue Apr 25, 2024 · 1 comment · Fixed by #74

Comments

@isjwdu
Copy link

isjwdu commented Apr 25, 2024

Hello, thank you for your great work.

I would like to ask two questions:

  1. Regarding the problem of batch inference of different sentences from the same speaker.
    I am now using --file to read a txt file containing multiple lines (taking 4 lines as an example), and an error will be reported during inference:
File "/mnt/E/isjwdu/Matcha-TTS/matcha/models/components/text_encoder.py", line 403, in forward
     x = torch.cat([x, spks.unsqueeze(-1).repeat(1, 1, x.shape[-1])], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 1 for tensor number 1 in the list.

Is the original code set to read only through a single line? Is there any recommended way if I want to reason about multiple sentences in batches?

  1. For batch inference of the same txt text containing different speakers and different sentences, are there any code modification suggestions and tips?

For example my txt:

p329-016|p329|the norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.
p316-091|p316|there was no bad behavior.

I want to inference different audio files based on different speakers.

Looking forward to your reply

@shivammehta25
Copy link
Owner

Hello,
I have fixed the multispeaker batched synthesis.

For the 2nd part, I have not yet added any code to support for different speaker batched inference. However it should be relatively simple all you need to do is extract the text and speakers from the input file and stack them. I don't plan to merge it into the codebase, but let me know if you need any help with it, I can add it to your fork or something.

Kind Regards,
Shivam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants