Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different tones on different parts of the text #490

Open
porky11 opened this issue May 4, 2024 · 4 comments
Open

Different tones on different parts of the text #490

porky11 opened this issue May 4, 2024 · 4 comments

Comments

@porky11
Copy link

porky11 commented May 4, 2024

It would be nice if it was possible to add some specific tones to single words or whole sentences.

For example:

  • emphasis
  • sarcasm
  • whispering

(also all combinations, like whispering sarcasm)

Is this already possible somehow?
I saw espeak generates emphasis markers anyway, but maybe this could be altered manually in some way?

Or I could probably train and use different variants of some voice. But it seems it's not possible to switch voices without causing pauses, even when setting "sentence_silence" to 0. But this would probably be the best workaround so far.

It would still be nice if such a feature existed, preferably without the need of training new voices (if possible).

@Daburnell112
Copy link

I second this. Some sort of markup language would be nice if it exists.

@rmcpantoja
Copy link
Contributor

rmcpantoja commented May 19, 2024

Hi,
I will work in a pitch conditioning model soon and maybe PR these additions with new updates from the piper side.

@nmstoker
Copy link

@porky11 - how could sarcasm or whispering be applied to an output voice without training with relevant voice recordings?

Is there some proces you have in mind that could be applied to the audio to achieve this? My suspicion is that there isn't a viable way to do this (without the audio + training)

@synesthesiam
Copy link
Contributor

In my experience, you need audio data for each case (sarcasm, whispering, etc.) and then a multi speaker model needs to be trained with each case being a different "speaker".

This is exactly what the Thorsten emotional voice does (German).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants