-
-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diarization precision - is there way to improve it? #804
Comments
I have tried upgrading to Pyannote 3.1, and the problem persists. The alignment is pretty useless - even in a very controlled environment (ie. studio recording, BBC podcast, with 3 speakers), it is missing quiet a bit. Anyone had success in making this better? |
Ok, I figured out what I was doing wrong. I will leave the comment here in case someone has similar problem and will close the issue. When sending to diarization, I was using segments created by the transcription process. Segments were too long (ie. 3-5 sentences), which meant that sometimes speakers were changing in between and the model took the one that was the most common in that segment. I have now changed and am sending segments created by the alignment process, where segments are much shorter and the result is much better. |
@nikola1975 I am having the same issue, but your solution (the default code example in the README) doesn't solve it. Here's my code: options = {
"max_new_tokens": None,
"clip_timestamps": None,
"hallucination_silence_threshold": None
}
model = whisperx.load_model("large-v3", device, compute_type=compute_type, download_root=model_dir, language=language, asr_options=options)
audio = whisperx.load_audio(file_path)
result = model.transcribe(audio, batch_size=batch_size, chunk_size=10, print_progress=True)
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio, min_speakers=min_speakers)
result = whisperx.assign_word_speakers(diarize_segments, result) |
You are getting poor results from the diarization, or is it wrongly recognizing speakers? My results are not 100% precise now, but they are relatively close to it. I am not sure what are your expectations :) I suppose you are using Pyannote 3.1 model? Try to run diarization through this link and check if you are getting the same results: |
I am running speaker diarization, with Pyannote 3.0.1, and am struggling to improve results. The change of speakers is recognized in English relatively well, but the alignment is bit hit and miss. Sometimes the whole sentences with the next speaker are left in the previous speaker's segment and similar.
Here is the audio file:
https://s3.eu-central-2.wasabisys.com/qira/12/2024/5/inourtime-hobsbawm_6min_1_1715693802051/inourtime-hobsbawm_6min_1_1715693802051.mp3
Here is the example from the beginning of the file, the last sentence in the first segment is spoken by the SPEAKER_02, and it stays within the SPEAKER_01. Then it starts the new segment at the start of the next sentence.
Any way to improve this?
The text was updated successfully, but these errors were encountered: