Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting tokenizers tokenizers into tiktoken tokenizers #1530

Open
umarbutler opened this issue May 13, 2024 · 4 comments
Open

Converting tokenizers tokenizers into tiktoken tokenizers #1530

umarbutler opened this issue May 13, 2024 · 4 comments

Comments

@umarbutler
Copy link

umarbutler commented May 13, 2024

tiktoken is supposed to be much faster than tokenizers for BPE tokenizers. It would be convenient if we could take advantage of the interoperability and easy of training of tokenizers tokenizers like, say, RobertaTokenizerFast and transform them into tiktoken tokenizers for faster tokenization.

Is this something the team is open to? It seems technically possible but I do not yet know enough about tiktoken to implement a converter myself.

@ArthurZucker
Copy link
Collaborator

Mmm not sure no. I think working on a more efficient version of our BPE tokenizer that does not support word ids and etc would be more worthwhile TBH

@Narsil
Copy link
Collaborator

Narsil commented May 15, 2024

tiktoken is supposed to be much faster than tokenizers for BPE tokenizers.

Proof please.
Also proof that the difference in speed is actually relevant in real world use cases.
If tokenization takes 0.1% of the time in real workloads (Which it kind of is in. a lot of current LLMs scenarios, but I may not be understanding all use cases) then even infinitely faster speedups are kind of irrelevant.

Here is an example of a good case where the problem was explained properly, which enabled us to solve it.
#1413 (comment)

Fwiw, there are a lot of libraries claiming faster than X, even much faster than tiktoken too.

@umarbutler
Copy link
Author

umarbutler commented May 17, 2024

Proof please.

image
Source: tiktoken GitHub repository.

See also this quick benchmark I just ran myself:

import tiktoken

from transformers import GPT2TokenizerFast

tt_tokeniser = tiktoken.encoding_for_model('gpt2')
tok_tokeniser = GPT2TokenizerFast.from_pretrained('gpt2')

text = "OpenAI's `tiktoken` tokeniser benchmarks as faster than Hugging Face's tokeniser."

%timeit tt_tokeniser.encode(text) # 14 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit tok_tokeniser.encode(text) # 56.5 µs ± 625 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Also proof that the difference in speed is actually relevant in real world use cases.
If tokenization takes 0.1% of the time in real workloads (Which it kind of is in. a lot of current LLMs scenarios, but I may not be understanding all use cases) then even infinitely faster speedups are kind of irrelevant.

I created a semantic chunking library called semchunk and the biggest bottleneck right now is the tokeniser because it is requires repeatedly counting the number of tokens in texts. This is just one use case. And when I am chunking extremely large text datasets (10GB+), it can add up very quickly.

The fact that OpenAI created tiktoken in the first place would suggest that there are features lacking in Hugging Face's tokenisers that they felt were necessary. The fact that they produced benchmarks of the tokeniser's speed further suggests that tokenisation speed is something meaningful to them.

I am sure I am not alone in this.

Mmm not sure no. I think working on a more efficient version of our BPE tokenizer that does not support word ids and etc would be more worthwhile TBH

This would actually be my preference. @Narsil @ArthurZucker If, however, you are not interested in pursuing that at this time, I'm happy to close this issue and work on publishing a faster implementation myself that relies on tiktoken.

@ArthurZucker
Copy link
Collaborator

I am actually interested in deep diving a bit into potential reasons why we are slower, and update our implementation based on this as long as we don't break, and otherwise have a new faster BPE. As @Narsil mentioned, benchmarks are tricky to get right, and I am supprised that the thread count does not help much tokenizers in the provided bench.

I'll see what I can do once I have a bit less on my plate!
FYI @itazap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants