Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) #159

Open
DavidFarago opened this issue Apr 3, 2023 · 30 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@DavidFarago
Copy link

The README.md says "more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo)".

Will this eventually be integrated into this repo, too? That would be really awesome. If so, is there a rough time estimate when it will be integrated?

Is this related to #57?

@m-bain
Copy link
Owner

m-bain commented Apr 3, 2023

For ease of use, we decided to just import openai's whisper implementation for transcription stage, which doesnt support batching. The one in the previous commit has some accuracy issues which I don't have time to debug rn.

The 70x real time described in the paper was using a custom implementation of whisper with batching that I wont be open-sourcing for the time being.

Note that others have had success using faster-whisper as a drop-in replacement for whisper in this repo:
https://github.com/guillaumekln/faster-whisper
This should give a speedup, albeit not due to batching (and wont take full advantage of high-performance GPUs)

There are quite a lot of different uses-cases and trade-offs which is a bit hard to support entirely in this repo (faster-whisper, real-time transcription, low gpu memory reqs etc. etc.).

For large-scale / business use-cases I will be providing an API soon (~1/3 of the price of openai's API), and also available to consult.

@mezaros
Copy link

mezaros commented Apr 4, 2023

David doesn't mention it, but this text is a change from what your readme said earlier. You had previously announced this code would be open source and was coming soon to the repo.

Extremely disappointing that others now need to duplicate this effort.

@m-bain
Copy link
Owner

m-bain commented Apr 4, 2023

@mezaros although it may be a disappointment to you, this repo is intended for research purposes and all the algorithms and pipelines in the paper have been open-sourced. But thank you for the feedback

@Infinitay
Copy link

Infinitay commented Apr 4, 2023

The 70x real time described in the paper was using a custom implementation of whisper with batching that I wont be open-sourcing for the time being.

I'm looking forward to when you feel comfortable on open-sourcing the batch processing. I rely on whisperx for transcribing youtube videos for (better) captions and past broadcasts on livestreaming platforms, and later translating them. The speed up would be nice for the past broadcasts cause they can span hours in length, so for me it takes almost the same duration as the videos in order to transcribe them.

Also, did you end up publishing an updated or final version of the paper? I'm not seeing where the number for up-to a 70x speed-up is coming from in 2303.00747.

Repository owner deleted a comment from arnavmehta7 Apr 4, 2023
@m-bain
Copy link
Owner

m-bain commented Apr 4, 2023

Thanks, have you tried the faster-whisper drop in mentioned above? This should give you a ~4-5x speed-up.

Also, did you end up publishing an updated or final version of the paper? I'm not seeing where the number for up-to a 70x speed-up is coming from in 2303.00747.

The number in the table was normalized over openai's large-v2 inference speed -- which was already running at 6x real time on our v100 gpu with the VAD filter (so 12x this with ours).

@Infinitay
Copy link

Thanks, have you tried the faster-whisper drop in mentioned above? This should give you a ~4-5x speed-up.

Not as of yet. I figured I'd wait for it to be eventually merged to upstream from the existing PR but I guess that won't be the case anymore given v2. I've been actually backlogging watching some videos because of this, but I suppose I'll finally give it a try now that it won't be implemented officially.

Thanks again for all your work

@m-bain
Copy link
Owner

m-bain commented Apr 4, 2023

I see, I can look into adding faster whisper as an optional import when i have some time (I just dont want to force it since it needs specific cuda/cudnn versions)

@m-bain
Copy link
Owner

m-bain commented Apr 5, 2023

Update, I did some speed benchmarking on GPU, faster-whisper is good it seems, and pretty fast all things considered


Model details

whisper_arch: large-v2

beam_size: 5


Speed benchmark:

File name: DanielKahneman_2010.wav

File duration: 20min 37secs

GPU: NIVIDA RTX 8000

Batch size: 16 (For whisperX)

Method Inference time (seconds) Inference Speed (real-time multiple) Avg. WER (TEDLIUM test)
openai 232.8 5.28x 9.54
faster-whisper 62.4 19.7x 9.94
Whisperx-batched (VAD+ASR) 17.8 69.1x 9.46

@m-bain m-bain added the question Further information is requested label Apr 5, 2023
@Infinitay
Copy link

Infinitay commented Apr 5, 2023

The API later provided would be nice for those of us on personal computers that can't utilize batched whisperx (when/if open sourced) due to GPU limitations. I would have expected a higher WER for faster-whisper but the difference seems slightly negligible. Just to confirm, when testing faster-whisper did you still use VAD because they added support using Silero VAD a few days ago IIRC?

@dustinjoe
Copy link

Update, I did some speed benchmarking on GPU, faster-whisper is good it seems, and pretty fast all things considered

Model details

whisper_arch: large-v2

beam_size: 5

Speed benchmark:

File name: DanielKahneman_2010.wav

File duration: 20min 37secs

GPU: NIVIDA RTX 8000

Batch size: 16 (For whisperX)

Method Inference time (seconds) Inference Speed (real-time multiple) Avg. WER (TEDLIUM test)
openai 232.8 5.28x 9.54
faster-whisper 62.4 19.7x 9.94
Whisperx-batched (VAD+ASR) 17.8 69.1x 9.46

Hi, is this FP16 precision for faster-whisper here? thanks

@m-bain
Copy link
Owner

m-bain commented Apr 6, 2023

FP16, without VAD

@m-bain m-bain pinned this issue Apr 6, 2023
@dustinjoe
Copy link

dustinjoe commented Apr 6, 2023

FP16, without VAD

Thanks. Looking forward to your batch inference in the future WhisperX. Actually I am trying to combine it together with Pyannote Diarization. The batch inference removed from WhisperX (due to the error rate problem I think) was about one time faster than this FP16 Faster_Whisper in my tests.

@mrmachine
Copy link

How do I go about actually dropping in the drop-in replacement faster-whisper? Just pip install it before or after installing whisperx?

@yigitkonur
Copy link

When will you be releasing the API service you mentioned, @m-bain? I'm really looking forward to it!

@RaulKite
Copy link

Will be whisperX take any advantages of this?

https://twitter.com/sanchitgandhi99/status/1649046650793648128?s=46&t=ApbND8sYhhD91NQ3JEdDbA

Whisper JAX ⚡️ is a highly optimised Whisper implementation for both GPU and TPU

image

@m-bain
Copy link
Owner

m-bain commented Apr 25, 2023

Will be whisperX take any advantages of this?

I found whisper jax to use crazy amounts of GPU memory (48GB?), and also led to worse transcription quality.

Anyway, I am now open-sourcing WhisperX v3 (see prerelease branch here, which includes ** the 70x realtime batched inference,** with only <16GB gpu memory using faster-whisper as a backend. The transcription quality is just as a good as original method

If you want to try it out checkout the v3 branch, and let me know if you run into any issues (still testing)
https://github.com/m-bain/whisperX/tree/v3

I am postponing the building API because it was taking up too much time from my PhD. I will return to building the API once I have improved the diarization -- a lot of work needed on that front.

@DavidFarago @RaulKite @dustinjoe @mrmachine @Infinitay @mezaros

@m-bain m-bain changed the title More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo) More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) Apr 25, 2023
@m-bain m-bain modified the milestone: MVP with pyannote Apr 25, 2023
@m-bain m-bain added the enhancement New feature or request label Apr 25, 2023
@Infinitay
Copy link

I'm both surprised and very thankful you've decided to open-source your batch improvements. I look forward to using it, but I hope that I won't exceed over the 10GB since I'm limited by my 3080. On the other hand, sorry to hear that you won't be able to monetize your work with an API due to your research. I hope it all works out well for you. Looking forward to the future of whisperX as the short-term changes have been incredibly so far.

@m-bain
Copy link
Owner

m-bain commented Apr 25, 2023

I hope that I won't exceed over the 10GB since I'm limited by my 3080

I haven't benchmarked it but you should be able to get memory requirements down below 10G with any of the following:

  1. reduce batch size (e.g. --batch_size 4, but will reduce transcription speed to maybe 50x)
  2. change faster-whisper compute type --compute_type int8
  3. use smaller whisper model like --model small or --model base

2 & 3 might reduce transcription quality though but worth playing around to see.

Looking forward to the future of whisperX as the short-term changes have been incredibly so far.

Thank for your kind words, I am glad it has helped you. I will try to figure out over the next months how best to keep whisperX improving sustainably.

@dustinjoe
Copy link

dustinjoe commented Apr 26, 2023

Really thank you for your efforts on this great work! Had a trial on v3 and the batch inference is working properly. Yeah, I totally agree with your opinion on the difficulty of adding diarization to this efficiently. As I can see, making 30 seconds chunks could often mix different speaker's sentences this way. This makes it really difficult to differentiate later. So this way, the GPU would not be utilized efficiently without batch inference when the diarization is working, a little question, would multiprocessing help somehow as a temporary solution for combining ASR and Diarization? Thanks

@guillaumekln
Copy link

Hello @m-bain, it's great to know that you are trying to use faster-whisper for batch execution.

It should work well overall but there is currently one limitation regarding the prompt tokens. The implementation currently requires that each prompt has the same number of "previous text tokens" (or put differently, the token "start of transcript" must be at the same position for each batch). I don't know if you already faced this limitation or if you are able to effectively work around it.

Let me know if there are other issues.

@m-bain
Copy link
Owner

m-bain commented Apr 28, 2023

@guillaumekln thanks for faster-whisper! I was previously using a custom implementation but yours really speeds up beam_size>1 and reduces gpu mem reqs 👌🏽

Yes there are a few limitations / assumptions when doing batched transcription but transcription quality remained high. These assumptions are:

(i) transcribe without_timestamps=True, this is necessary otherwise Whisper might do multiple forward passes with a 30s sample (delaying the whole batch) and can also lead to repetition etc.

(ii) Identical prompt tokens like you say, I find its not an issue since --condition_on_previous_text False is the more robust setting when I compare on benchmarks.

Of course (i) can be quite limiting due to the need for timestamped transcripts, but in WhisperX timestamps are sourced from VAD & wav2vec2 alignment -- from my research findings Whisper timestamps were just too unreliable

@sorgfresser
Copy link
Contributor

Very surprised and happy to hear you're open sourcing batch inference. I managed to get whisper.cpp to work with whisperX v2. It was not really a drop-in but not too much changes had to be done either. Now that you're enabled batch inference, there is no need for any kind of PR for this, instead this might be the correct thread to share it (if you think this doesn't fit in here, please give me a hint).
For whisper.cpp there does not appear to be any padding necessary (at least my benchmark tells me so), as such we can simply remove the rewritten transcribe() and use this

model.context.full_parallel(model.params, seg_audio, 1)
output["segments"].append(
  {
      "start": seg_t["start"],
      "end": seg_t["end"],
      "text": "".join(model.context.full_get_segment_text(i) for i in range(model.context.full_n_segments())),
  }
)

Optionally you can add padding by replacing model.context.full_parallel(model.params, seg_audio, 1) using

padded_audio = pad_or_trim(seg_audio, N_SAMPLES)
model.context.full_parallel(model.params, padded_audio, 1)

I am using the python bindings by aarnphm here. When using finetuned models, the whole segment-part could be broken. I'd advise to use single segment mode in this case.

@DigilConfianz
Copy link

DigilConfianz commented May 8, 2023

Not sure if this is the right thread, but is it possible to reduce the pyannote diarization time too, by using some logic similar to that of faster-whisper? ie, using CTranslate2, reducing floating point precision, some sort of batching etc? Currently diarization takes more time than the transcription itself. @guillaumekln @m-bain

@sorgfresser
Copy link
Contributor

The quicker fix would certainly be exporting the pyannote model to ONNX. Should speed it up too.

@ozancaglayan
Copy link

Hi thanks. Any pointers to a minimal amount of code required to wrap faster-whisper for adding support for this? Also, is this batching VAD segments of a given audio file and disabling --condition_on_previous_text? Or is it segmenting the file with VAD, concatenating back and then chunking it to 30-sec segments to apply batching?

@m-bain
Copy link
Owner

m-bain commented May 9, 2023

Hi thanks. Any pointers to a minimal amount of code required to wrap faster-whisper for adding support for this?

@ozancaglayan The main branch does exactly this

Not sure if this is the right thread, but is it possible to reduce the pyannote diarization time too, by using some logic similar to that of faster-whisper?

@DigilConfianz yes pyannote is pretty slow.
For the video understanding research projects in our lab, we don't actually use pyannote, but rather https://github.com/JaesungHuh/SimpleDiarization by @JaesungHuh.

It's a lot faster and we found it effective for dialogue in movie scenes when constraining the diarization to sentence segments. See Appendix Section A (page 13) of https://www.robots.ox.ac.uk/~vgg/publications/2023/Han23/han23.pdf

Will add support for this diarization module at some point

@MyraBaba
Copy link

@m-bain is v3 still opensourced ? link giving 404

@samuelbradshaw
Copy link

@m-bain is v3 still opensourced ? link giving 404

I'm not sure, but I think the link to the v3 branch above has been merged into the main branch.

@joiemoie
Copy link

joiemoie commented Nov 3, 2023

Would batching be able to support multiple audio files? Such as multiple user requests from Triton?

@mohith7548
Copy link

I'm looking to transcribe multiple audio files at once with WhisperX - purely batch inference. Can anyone point me in the right direction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests