More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) #159

DavidFarago · 2023-04-03T08:31:21Z

The README.md says "more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo)".

Will this eventually be integrated into this repo, too? That would be really awesome. If so, is there a rough time estimate when it will be integrated?

Is this related to #57?

The text was updated successfully, but these errors were encountered:

m-bain · 2023-04-03T16:47:53Z

For ease of use, we decided to just import openai's whisper implementation for transcription stage, which doesnt support batching. The one in the previous commit has some accuracy issues which I don't have time to debug rn.

The 70x real time described in the paper was using a custom implementation of whisper with batching that I wont be open-sourcing for the time being.

Note that others have had success using faster-whisper as a drop-in replacement for whisper in this repo:
https://github.com/guillaumekln/faster-whisper
This should give a speedup, albeit not due to batching (and wont take full advantage of high-performance GPUs)

There are quite a lot of different uses-cases and trade-offs which is a bit hard to support entirely in this repo (faster-whisper, real-time transcription, low gpu memory reqs etc. etc.).

For large-scale / business use-cases I will be providing an API soon (~1/3 of the price of openai's API), and also available to consult.

mezaros · 2023-04-04T04:36:47Z

David doesn't mention it, but this text is a change from what your readme said earlier. You had previously announced this code would be open source and was coming soon to the repo.

Extremely disappointing that others now need to duplicate this effort.

m-bain · 2023-04-04T05:30:42Z

@mezaros although it may be a disappointment to you, this repo is intended for research purposes and all the algorithms and pipelines in the paper have been open-sourced. But thank you for the feedback

Infinitay · 2023-04-04T13:59:52Z

The 70x real time described in the paper was using a custom implementation of whisper with batching that I wont be open-sourcing for the time being.

I'm looking forward to when you feel comfortable on open-sourcing the batch processing. I rely on whisperx for transcribing youtube videos for (better) captions and past broadcasts on livestreaming platforms, and later translating them. The speed up would be nice for the past broadcasts cause they can span hours in length, so for me it takes almost the same duration as the videos in order to transcribe them.

Also, did you end up publishing an updated or final version of the paper? I'm not seeing where the number for up-to a 70x speed-up is coming from in 2303.00747.

m-bain · 2023-04-04T14:09:23Z

Thanks, have you tried the faster-whisper drop in mentioned above? This should give you a ~4-5x speed-up.

Also, did you end up publishing an updated or final version of the paper? I'm not seeing where the number for up-to a 70x speed-up is coming from in 2303.00747.

The number in the table was normalized over openai's large-v2 inference speed -- which was already running at 6x real time on our v100 gpu with the VAD filter (so 12x this with ours).

Infinitay · 2023-04-04T14:14:23Z

Thanks, have you tried the faster-whisper drop in mentioned above? This should give you a ~4-5x speed-up.

Not as of yet. I figured I'd wait for it to be eventually merged to upstream from the existing PR but I guess that won't be the case anymore given v2. I've been actually backlogging watching some videos because of this, but I suppose I'll finally give it a try now that it won't be implemented officially.

Thanks again for all your work

m-bain · 2023-04-04T15:07:01Z

I see, I can look into adding faster whisper as an optional import when i have some time (I just dont want to force it since it needs specific cuda/cudnn versions)

m-bain · 2023-04-05T16:09:41Z

Update, I did some speed benchmarking on GPU, faster-whisper is good it seems, and pretty fast all things considered

Model details

whisper_arch: large-v2

beam_size: 5

Speed benchmark:

File name: DanielKahneman_2010.wav

File duration: 20min 37secs

GPU: NIVIDA RTX 8000

Batch size: 16 (For whisperX)

Method	Inference time (seconds)	Inference Speed (real-time multiple)	Avg. WER (TEDLIUM test)
openai	232.8	5.28x	9.54
faster-whisper	62.4	19.7x	9.94
Whisperx-batched (VAD+ASR)	17.8	69.1x	9.46

Infinitay · 2023-04-05T17:10:24Z

The API later provided would be nice for those of us on personal computers that can't utilize batched whisperx (when/if open sourced) due to GPU limitations. I would have expected a higher WER for faster-whisper but the difference seems slightly negligible. Just to confirm, when testing faster-whisper did you still use VAD because they added support using Silero VAD a few days ago IIRC?

dustinjoe · 2023-04-06T19:51:55Z

Update, I did some speed benchmarking on GPU, faster-whisper is good it seems, and pretty fast all things considered

Model details

whisper_arch: large-v2

beam_size: 5

Speed benchmark:

File name: DanielKahneman_2010.wav

File duration: 20min 37secs

GPU: NIVIDA RTX 8000

Batch size: 16 (For whisperX)

Method Inference time (seconds) Inference Speed (real-time multiple) Avg. WER (TEDLIUM test)
openai 232.8 5.28x 9.54
faster-whisper 62.4 19.7x 9.94
Whisperx-batched (VAD+ASR) 17.8 69.1x 9.46

Hi, is this FP16 precision for faster-whisper here? thanks

m-bain · 2023-04-06T19:54:01Z

FP16, without VAD

dustinjoe · 2023-04-06T20:00:14Z

FP16, without VAD

Thanks. Looking forward to your batch inference in the future WhisperX. Actually I am trying to combine it together with Pyannote Diarization. The batch inference removed from WhisperX (due to the error rate problem I think) was about one time faster than this FP16 Faster_Whisper in my tests.

mrmachine · 2023-04-14T03:50:19Z

How do I go about actually dropping in the drop-in replacement faster-whisper? Just pip install it before or after installing whisperx?

yigitkonur · 2023-04-14T21:32:38Z

When will you be releasing the API service you mentioned, @m-bain? I'm really looking forward to it!

RaulKite · 2023-04-23T08:26:41Z

Will be whisperX take any advantages of this?

https://twitter.com/sanchitgandhi99/status/1649046650793648128?s=46&t=ApbND8sYhhD91NQ3JEdDbA

Whisper JAX ⚡️ is a highly optimised Whisper implementation for both GPU and TPU

m-bain · 2023-04-25T11:19:56Z

Will be whisperX take any advantages of this?

I found whisper jax to use crazy amounts of GPU memory (48GB?), and also led to worse transcription quality.

Anyway, I am now open-sourcing WhisperX v3 (see prerelease branch here, which includes ** the 70x realtime batched inference,** with only <16GB gpu memory using faster-whisper as a backend. The transcription quality is just as a good as original method

If you want to try it out checkout the v3 branch, and let me know if you run into any issues (still testing)
https://github.com/m-bain/whisperX/tree/v3

I am postponing the building API because it was taking up too much time from my PhD. I will return to building the API once I have improved the diarization -- a lot of work needed on that front.

@DavidFarago @RaulKite @dustinjoe @mrmachine @Infinitay @mezaros

Infinitay · 2023-04-25T11:45:26Z

I'm both surprised and very thankful you've decided to open-source your batch improvements. I look forward to using it, but I hope that I won't exceed over the 10GB since I'm limited by my 3080. On the other hand, sorry to hear that you won't be able to monetize your work with an API due to your research. I hope it all works out well for you. Looking forward to the future of whisperX as the short-term changes have been incredibly so far.

m-bain · 2023-04-25T12:39:47Z

I hope that I won't exceed over the 10GB since I'm limited by my 3080

I haven't benchmarked it but you should be able to get memory requirements down below 10G with any of the following:

reduce batch size (e.g. --batch_size 4, but will reduce transcription speed to maybe 50x)
change faster-whisper compute type --compute_type int8
use smaller whisper model like --model small or --model base

2 & 3 might reduce transcription quality though but worth playing around to see.

Looking forward to the future of whisperX as the short-term changes have been incredibly so far.

Thank for your kind words, I am glad it has helped you. I will try to figure out over the next months how best to keep whisperX improving sustainably.

dustinjoe · 2023-04-26T15:55:31Z

Really thank you for your efforts on this great work! Had a trial on v3 and the batch inference is working properly. Yeah, I totally agree with your opinion on the difficulty of adding diarization to this efficiently. As I can see, making 30 seconds chunks could often mix different speaker's sentences this way. This makes it really difficult to differentiate later. So this way, the GPU would not be utilized efficiently without batch inference when the diarization is working, a little question, would multiprocessing help somehow as a temporary solution for combining ASR and Diarization? Thanks

guillaumekln · 2023-04-28T16:14:15Z

Hello @m-bain, it's great to know that you are trying to use faster-whisper for batch execution.

It should work well overall but there is currently one limitation regarding the prompt tokens. The implementation currently requires that each prompt has the same number of "previous text tokens" (or put differently, the token "start of transcript" must be at the same position for each batch). I don't know if you already faced this limitation or if you are able to effectively work around it.

Let me know if there are other issues.

m-bain · 2023-04-28T21:07:04Z

@guillaumekln thanks for faster-whisper! I was previously using a custom implementation but yours really speeds up beam_size>1 and reduces gpu mem reqs 👌🏽

Yes there are a few limitations / assumptions when doing batched transcription but transcription quality remained high. These assumptions are:

(i) transcribe without_timestamps=True, this is necessary otherwise Whisper might do multiple forward passes with a 30s sample (delaying the whole batch) and can also lead to repetition etc.

(ii) Identical prompt tokens like you say, I find its not an issue since --condition_on_previous_text False is the more robust setting when I compare on benchmarks.

Of course (i) can be quite limiting due to the need for timestamped transcripts, but in WhisperX timestamps are sourced from VAD & wav2vec2 alignment -- from my research findings Whisper timestamps were just too unreliable

sorgfresser · 2023-04-28T23:30:00Z

Very surprised and happy to hear you're open sourcing batch inference. I managed to get whisper.cpp to work with whisperX v2. It was not really a drop-in but not too much changes had to be done either. Now that you're enabled batch inference, there is no need for any kind of PR for this, instead this might be the correct thread to share it (if you think this doesn't fit in here, please give me a hint).
For whisper.cpp there does not appear to be any padding necessary (at least my benchmark tells me so), as such we can simply remove the rewritten transcribe() and use this

model.context.full_parallel(model.params, seg_audio, 1)
output["segments"].append(
  {
      "start": seg_t["start"],
      "end": seg_t["end"],
      "text": "".join(model.context.full_get_segment_text(i) for i in range(model.context.full_n_segments())),
  }
)

Optionally you can add padding by replacing model.context.full_parallel(model.params, seg_audio, 1) using

padded_audio = pad_or_trim(seg_audio, N_SAMPLES)
model.context.full_parallel(model.params, padded_audio, 1)

I am using the python bindings by aarnphm here. When using finetuned models, the whole segment-part could be broken. I'd advise to use single segment mode in this case.

DigilConfianz · 2023-05-08T15:07:10Z

Not sure if this is the right thread, but is it possible to reduce the pyannote diarization time too, by using some logic similar to that of faster-whisper? ie, using CTranslate2, reducing floating point precision, some sort of batching etc? Currently diarization takes more time than the transcription itself. @guillaumekln @m-bain

sorgfresser · 2023-05-08T17:53:56Z

The quicker fix would certainly be exporting the pyannote model to ONNX. Should speed it up too.

ozancaglayan · 2023-05-09T11:46:21Z

Hi thanks. Any pointers to a minimal amount of code required to wrap faster-whisper for adding support for this? Also, is this batching VAD segments of a given audio file and disabling --condition_on_previous_text? Or is it segmenting the file with VAD, concatenating back and then chunking it to 30-sec segments to apply batching?

m-bain · 2023-05-09T12:34:16Z

Hi thanks. Any pointers to a minimal amount of code required to wrap faster-whisper for adding support for this?

@ozancaglayan The main branch does exactly this

Not sure if this is the right thread, but is it possible to reduce the pyannote diarization time too, by using some logic similar to that of faster-whisper?

@DigilConfianz yes pyannote is pretty slow.
For the video understanding research projects in our lab, we don't actually use pyannote, but rather https://github.com/JaesungHuh/SimpleDiarization by @JaesungHuh.

It's a lot faster and we found it effective for dialogue in movie scenes when constraining the diarization to sentence segments. See Appendix Section A (page 13) of https://www.robots.ox.ac.uk/~vgg/publications/2023/Han23/han23.pdf

Will add support for this diarization module at some point

MyraBaba · 2023-07-13T18:18:04Z

@m-bain is v3 still opensourced ? link giving 404

samuelbradshaw · 2023-10-04T00:55:31Z

@m-bain is v3 still opensourced ? link giving 404

I'm not sure, but I think the link to the v3 branch above has been merged into the main branch.

joiemoie · 2023-11-03T00:33:59Z

Would batching be able to support multiple audio files? Such as multiple user requests from Triton?

mohith7548 · 2024-01-19T06:21:49Z

I'm looking to transcribe multiple audio files at once with WhisperX - purely batch inference. Can anyone point me in the right direction?

Repository owner deleted a comment from arnavmehta7 Apr 4, 2023

m-bain added the question Further information is requested label Apr 5, 2023

m-bain pinned this issue Apr 6, 2023

Infinitay mentioned this issue Apr 13, 2023

*60-70x REAL TIME speed #177

Closed

m-bain changed the title ~~More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo)~~ More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) Apr 25, 2023

m-bain modified the milestone: MVP with pyannote Apr 25, 2023

m-bain added the enhancement New feature or request label Apr 25, 2023

This was referenced May 1, 2023

Memory issues #89

Closed

Fail gracefully on GPU OutOfMemoryError #102

Closed

guillaumekln mentioned this issue May 3, 2023

batch execution transcribe in faster-whisper SYSTRAN/faster-whisper#59

Open

DigilConfianz mentioned this issue May 26, 2023

Diarization too slow #274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) #159

More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) #159

DavidFarago commented Apr 3, 2023

m-bain commented Apr 3, 2023 •

edited

mezaros commented Apr 4, 2023

m-bain commented Apr 4, 2023

Infinitay commented Apr 4, 2023 •

edited

m-bain commented Apr 4, 2023 •

edited

Infinitay commented Apr 4, 2023

m-bain commented Apr 4, 2023

m-bain commented Apr 5, 2023 •

edited

Infinitay commented Apr 5, 2023 •

edited

dustinjoe commented Apr 6, 2023

m-bain commented Apr 6, 2023

dustinjoe commented Apr 6, 2023 •

edited

mrmachine commented Apr 14, 2023

yigitkonur commented Apr 14, 2023

RaulKite commented Apr 23, 2023

m-bain commented Apr 25, 2023 •

edited

Infinitay commented Apr 25, 2023

m-bain commented Apr 25, 2023 •

edited

dustinjoe commented Apr 26, 2023 •

edited

guillaumekln commented Apr 28, 2023

m-bain commented Apr 28, 2023 •

edited

sorgfresser commented Apr 28, 2023

DigilConfianz commented May 8, 2023 •

edited

sorgfresser commented May 8, 2023

ozancaglayan commented May 9, 2023

m-bain commented May 9, 2023

MyraBaba commented Jul 13, 2023

samuelbradshaw commented Oct 4, 2023

joiemoie commented Nov 3, 2023

mohith7548 commented Jan 19, 2024

More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) #159

More efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (now in custom v3 branch, see comment for details) #159

Comments

DavidFarago commented Apr 3, 2023

m-bain commented Apr 3, 2023 • edited

mezaros commented Apr 4, 2023

m-bain commented Apr 4, 2023

Infinitay commented Apr 4, 2023 • edited

m-bain commented Apr 4, 2023 • edited

Infinitay commented Apr 4, 2023

m-bain commented Apr 4, 2023

m-bain commented Apr 5, 2023 • edited

Infinitay commented Apr 5, 2023 • edited

dustinjoe commented Apr 6, 2023

m-bain commented Apr 6, 2023

dustinjoe commented Apr 6, 2023 • edited

mrmachine commented Apr 14, 2023

yigitkonur commented Apr 14, 2023

RaulKite commented Apr 23, 2023

m-bain commented Apr 25, 2023 • edited

Infinitay commented Apr 25, 2023

m-bain commented Apr 25, 2023 • edited

dustinjoe commented Apr 26, 2023 • edited

guillaumekln commented Apr 28, 2023

m-bain commented Apr 28, 2023 • edited

sorgfresser commented Apr 28, 2023

DigilConfianz commented May 8, 2023 • edited

sorgfresser commented May 8, 2023

ozancaglayan commented May 9, 2023

m-bain commented May 9, 2023

MyraBaba commented Jul 13, 2023

samuelbradshaw commented Oct 4, 2023

joiemoie commented Nov 3, 2023

mohith7548 commented Jan 19, 2024

m-bain commented Apr 3, 2023 •

edited

Infinitay commented Apr 4, 2023 •

edited

m-bain commented Apr 4, 2023 •

edited

m-bain commented Apr 5, 2023 •

edited

Infinitay commented Apr 5, 2023 •

edited

dustinjoe commented Apr 6, 2023 •

edited

m-bain commented Apr 25, 2023 •

edited

m-bain commented Apr 25, 2023 •

edited

dustinjoe commented Apr 26, 2023 •

edited

m-bain commented Apr 28, 2023 •

edited

DigilConfianz commented May 8, 2023 •

edited