Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'split_special_tokens' #30685

Closed
2 of 4 tasks
fahadh4ilyas opened this issue May 7, 2024 · 3 comments · Fixed by #30772
Closed
2 of 4 tasks
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@fahadh4ilyas
Copy link

fahadh4ilyas commented May 7, 2024

System Info

Transformer version: 4.38.1
Platform: Ubuntu
Python version: 3.10.13

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

access_token = 'YOUR_ACCESS_TOKEN'

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B', token=access_token)

print(tokenizer('Here is an example of bos token: <|begin_of_text|>', split_special_tokens=True)

Expected behavior

Must return this:

{'input_ids': [128000, 8586, 374, 459, 3187, 315, 43746, 4037, 25, 83739, 7413, 3659, 4424, 91, 29], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
@amyeroberts amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label May 7, 2024
@NielsRogge
Copy link
Contributor

Hi,

this flag is only supported for slow tokenizers. See:

split_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the special tokens should be split during the tokenization process. The default behavior is
to not split special tokens. This means that if `<s>` is the `bos_token`, then `tokenizer.tokenize("<s>") =
['<s>`]. Otherwise, if `split_special_tokens=True`, then `tokenizer.tokenize("<s>")` will be give `['<',
's', '>']`. This argument is only supported for `slow` tokenizers for the moment.

@ArthurZucker
Copy link
Collaborator

#28648 should fix this.

@ArthurZucker
Copy link
Collaborator

cc @itazap if you can take it over! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants