-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting tokenizers
tokenizers into tiktoken
tokenizers
#1530
Comments
Mmm not sure no. I think working on a more efficient version of our BPE tokenizer that does not support word ids and etc would be more worthwhile TBH |
Proof please. Here is an example of a good case where the problem was explained properly, which enabled us to solve it. Fwiw, there are a lot of libraries claiming faster than X, even much faster than tiktoken too. |
See also this quick benchmark I just ran myself: import tiktoken
from transformers import GPT2TokenizerFast
tt_tokeniser = tiktoken.encoding_for_model('gpt2')
tok_tokeniser = GPT2TokenizerFast.from_pretrained('gpt2')
text = "OpenAI's `tiktoken` tokeniser benchmarks as faster than Hugging Face's tokeniser."
%timeit tt_tokeniser.encode(text) # 14 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit tok_tokeniser.encode(text) # 56.5 µs ± 625 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I created a semantic chunking library called The fact that OpenAI created I am sure I am not alone in this.
This would actually be my preference. @Narsil @ArthurZucker If, however, you are not interested in pursuing that at this time, I'm happy to close this issue and work on publishing a faster implementation myself that relies on |
I am actually interested in deep diving a bit into potential reasons why we are slower, and update our implementation based on this as long as we don't break, and otherwise have a new faster BPE. As @Narsil mentioned, benchmarks are tricky to get right, and I am supprised that the thread count does not help much tokenizers in the provided bench. I'll see what I can do once I have a bit less on my plate! |
tiktoken
is supposed to be much faster thantokenizers
for BPE tokenizers. It would be convenient if we could take advantage of the interoperability and easy of training oftokenizers
tokenizers like, say,RobertaTokenizerFast
and transform them intotiktoken
tokenizers for faster tokenization.Is this something the team is open to? It seems technically possible but I do not yet know enough about
tiktoken
to implement a converter myself.The text was updated successfully, but these errors were encountered: