Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change alignment library from whisperx to ctc-forced-aligner #184

Merged
merged 5 commits into from
May 20, 2024

Conversation

MahmoudAshraf97
Copy link
Owner

Pros:

  • around 2x faster
  • uses a universal multilingual model which is more suitable for deployment scenarios instead of switching models (the old model switching is still working if needed)
  • doesn't need any whisper timestamps (segments or words) so it can run in parallel with transcription

Cons

  • the default model has a non-commercial licence but can be changed if needed
  • whisperX is still needed for transcription so that will add an extra dependency
  • although the results should be very close, it is still not tested compared to the stable implementation used now

@transcriptionstream
Copy link
Contributor

transcriptionstream commented May 7, 2024

wondering where the license info for the universal multilingual model can be found

@MahmoudAshraf97
Copy link
Owner Author

Hi @transcriptionstream
https://huggingface.co/MahmoudAshraf/mms-300m-1130-forced-aligner
I was about to request your review btw :)

@transcriptionstream
Copy link
Contributor

transcriptionstream commented May 8, 2024

I'll work on getting a build going and test it out. Intrigued by the performance increase. I've got over 30k, and counting, diarizations done for a recent client utilizing the old model - the increase in speed with this model sounds wild and game changing!

@transcriptionstream
Copy link
Contributor

transcriptionstream commented May 8, 2024

getting the following errors when trying to build this branch

Some weights of the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

diarize.py 134 <module>
emissions, stride = generate_emissions(

alignment_utils.py 129 generate_emissions
emissions_ = model(input_batch).logits

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 1969 forward
outputs = self.wav2vec2(

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 1554 forward
extract_features = self.feature_extractor(input_values)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 461 forward
hidden_states = conv_layer(hidden_states)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 336 forward
hidden_states = self.conv(hidden_states)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

conv.py 313 forward
return self._conv_forward(input, self.weight, self.bias)

conv.py 309 _conv_forward
return F.conv1d(input, weight, bias, self.stride,

RuntimeError:
"slow_conv2d_cpu" not implemented for 'Half'

@MahmoudAshraf97
Copy link
Owner Author

you can ignore the first warning, huggingface/transformers#30628
the second error was fixed, the model was set to load in float16 which isn't supported on cpu so I added a device check first

@transcriptionstream
Copy link
Contributor

transcriptionstream commented May 10, 2024

Thanks! Got it built and am and running it through its paces. So far so good. Trying to get some good benchmarks on speed improvement. Quick tests show it's definitely faster and output is consistent with whisperx. Would love to try it in a prod env if the license can be modified.

@MahmoudAshraf97
Copy link
Owner Author

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab)
I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english
btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

@transcriptionstream
Copy link
Contributor

transcriptionstream commented May 14, 2024

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab) I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

Any chance you can put me in contact with the model owners? I'd love to ask some questions and see what they'd need to license it for commercial use.

@MahmoudAshraf97
Copy link
Owner Author

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab) I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

Any chance you can put me in contact with the model owners? I'd love to ask some questions and see what they'd need to license it for commercial use.

These are all relevant links, I don't have direct contact information unfortunately
https://llama.meta.com/faq/#legal
https://arxiv.org/abs/2305.13516

@MahmoudAshraf97 MahmoudAshraf97 merged commit 00aa56c into main May 20, 2024
2 checks passed
@MahmoudAshraf97 MahmoudAshraf97 deleted the mms_align branch May 23, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants