change alignment library from `whisperx` to `ctc-forced-aligner` #184

MahmoudAshraf97 · 2024-05-07T23:06:51Z

Pros:

around 2x faster
uses a universal multilingual model which is more suitable for deployment scenarios instead of switching models (the old model switching is still working if needed)
doesn't need any whisper timestamps (segments or words) so it can run in parallel with transcription

Cons

the default model has a non-commercial licence but can be changed if needed
whisperX is still needed for transcription so that will add an extra dependency
although the results should be very close, it is still not tested compared to the stable implementation used now

transcriptionstream · 2024-05-07T23:16:31Z

wondering where the license info for the universal multilingual model can be found

MahmoudAshraf97 · 2024-05-07T23:27:48Z

Hi @transcriptionstream
https://huggingface.co/MahmoudAshraf/mms-300m-1130-forced-aligner
I was about to request your review btw :)

transcriptionstream · 2024-05-08T00:53:00Z

I'll work on getting a build going and test it out. Intrigued by the performance increase. I've got over 30k, and counting, diarizations done for a recent client utilizing the old model - the increase in speed with this model sounds wild and game changing!

transcriptionstream · 2024-05-08T02:47:12Z

getting the following errors when trying to build this branch

Some weights of the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

diarize.py 134 <module>
emissions, stride = generate_emissions(

alignment_utils.py 129 generate_emissions
emissions_ = model(input_batch).logits

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 1969 forward
outputs = self.wav2vec2(

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 1554 forward
extract_features = self.feature_extractor(input_values)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 461 forward
hidden_states = conv_layer(hidden_states)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 336 forward
hidden_states = self.conv(hidden_states)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

conv.py 313 forward
return self._conv_forward(input, self.weight, self.bias)

conv.py 309 _conv_forward
return F.conv1d(input, weight, bias, self.stride,

RuntimeError:
"slow_conv2d_cpu" not implemented for 'Half'

MahmoudAshraf97 · 2024-05-08T09:24:26Z

you can ignore the first warning, huggingface/transformers#30628
the second error was fixed, the model was set to load in float16 which isn't supported on cpu so I added a device check first

transcriptionstream · 2024-05-10T00:54:10Z

Thanks! Got it built and am and running it through its paces. So far so good. Trying to get some good benchmarks on speed improvement. Quick tests show it's definitely faster and output is consistent with whisperx. Would love to try it in a prod env if the license can be modified.

MahmoudAshraf97 · 2024-05-10T01:22:42Z

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab)
I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english
btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

transcriptionstream · 2024-05-14T20:19:16Z

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab) I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

Any chance you can put me in contact with the model owners? I'd love to ask some questions and see what they'd need to license it for commercial use.

MahmoudAshraf97 · 2024-05-15T08:41:30Z

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab) I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

Any chance you can put me in contact with the model owners? I'd love to ask some questions and see what they'd need to license it for commercial use.

These are all relevant links, I don't have direct contact information unfortunately
https://llama.meta.com/faq/#legal
https://arxiv.org/abs/2305.13516

initial addition of new alignment code

782c705

changed alignment dtype for cpu because not kernels are supported

347ce72

MahmoudAshraf97 added 3 commits May 8, 2024 12:53

Merge branch 'main' into mms_align

dea1943

Merge branch 'main' into mms_align

3384270

add the changes to diarize_parallel

709e1ec

MahmoudAshraf97 mentioned this pull request May 8, 2024

feat: Add support for any language diarization #155

Closed

MahmoudAshraf97 merged commit 00aa56c into main May 20, 2024
2 checks passed

MahmoudAshraf97 deleted the mms_align branch May 23, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change alignment library from `whisperx` to `ctc-forced-aligner` #184

change alignment library from `whisperx` to `ctc-forced-aligner` #184

MahmoudAshraf97 commented May 7, 2024

transcriptionstream commented May 7, 2024 •

edited

MahmoudAshraf97 commented May 7, 2024

transcriptionstream commented May 8, 2024 •

edited

transcriptionstream commented May 8, 2024 •

edited by MahmoudAshraf97

MahmoudAshraf97 commented May 8, 2024

transcriptionstream commented May 10, 2024 •

edited

MahmoudAshraf97 commented May 10, 2024

transcriptionstream commented May 14, 2024 •

edited

MahmoudAshraf97 commented May 15, 2024

change alignment library from whisperx to ctc-forced-aligner #184

change alignment library from whisperx to ctc-forced-aligner #184

Conversation

MahmoudAshraf97 commented May 7, 2024

transcriptionstream commented May 7, 2024 • edited

MahmoudAshraf97 commented May 7, 2024

transcriptionstream commented May 8, 2024 • edited

transcriptionstream commented May 8, 2024 • edited by MahmoudAshraf97

MahmoudAshraf97 commented May 8, 2024

transcriptionstream commented May 10, 2024 • edited

MahmoudAshraf97 commented May 10, 2024

transcriptionstream commented May 14, 2024 • edited

MahmoudAshraf97 commented May 15, 2024

change alignment library from `whisperx` to `ctc-forced-aligner` #184

change alignment library from `whisperx` to `ctc-forced-aligner` #184

transcriptionstream commented May 7, 2024 •

edited

transcriptionstream commented May 8, 2024 •

edited

transcriptionstream commented May 8, 2024 •

edited by MahmoudAshraf97

transcriptionstream commented May 10, 2024 •

edited

transcriptionstream commented May 14, 2024 •

edited