Suppression of sequences of tokens as opposed to Single Tokens #127

filmo · 2023-11-12T01:01:03Z

filmo
Nov 12, 2023

I see that it's possible to suppress single tokens using the 'suppress_token' option to whisper.transcribe(). And as has been implemented in find_numeral_symbol_tokens()

          suppress_tokens: List of token IDs to suppress. -1 will suppress a default set
            of symbols as defined in the model config.json file.

Is it possible to create particular sequences of tokens that can be suppressed? In other words only if a particular sequence of tokens appears in order, but if a token in that set appears outside of the suppress set, it would still be allowed.

Example: Lets suppress the speech mannerism of ', you know,' that is a common filler phrase in spoken American English that generally provides no semantic content and is typically not transcribed in human transcription (unless specifically requested.)

I, you know, want to eat dinner.

becomes:

I want to eat dinner.

In the above example the set of tokens to suppress would be [',','you','know',','] => [11, 5616, 15869, 11], but obviously I would not want to individually suppress those tokens, only when in the form of the 'filler phrase'.

This could also be used for suppressing vulgarities that are composed of words that are individually innocuous.

MahmoudAshraf97 · 2023-11-24T00:47:44Z

MahmoudAshraf97
Nov 24, 2023
Maintainer

That should be possible but with a huge computational penalty as whisper decodes one token after another and the decoding of a single token is where the suppression takes place, my idea is to implement a function that keeps checking the transcript for the suppressed sequence and if detected, it should rewind the decoding to start right before the sequence with a temporarily modified suppression list, another simpler solution is to filter these sequences from the final token sequence which works better than the first one IMO

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppression of sequences of tokens as opposed to Single Tokens #127

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Suppression of sequences of tokens as opposed to Single Tokens #127

filmo Nov 12, 2023

Replies: 1 comment

MahmoudAshraf97 Nov 24, 2023 Maintainer

filmo
Nov 12, 2023

MahmoudAshraf97
Nov 24, 2023
Maintainer