`PreTrainedTokenizerFast._batch_encode_plus()` got an unexpected keyword argument `'split_special_tokens'` #30685

fahadh4ilyas · 2024-05-07T06:30:15Z

System Info

Transformer version: 4.38.1
Platform: Ubuntu
Python version: 3.10.13

Who can help?

@ArthurZucker @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

access_token = 'YOUR_ACCESS_TOKEN'

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B', token=access_token)

print(tokenizer('Here is an example of bos token: <|begin_of_text|>', split_special_tokens=True)

Expected behavior

Must return this:

{'input_ids': [128000, 8586, 374, 459, 3187, 315, 43746, 4037, 25, 83739, 7413, 3659, 4424, 91, 29], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The text was updated successfully, but these errors were encountered:

NielsRogge · 2024-05-08T22:34:52Z

Hi,

this flag is only supported for slow tokenizers. See:

transformers/src/transformers/tokenization_utils_base.py

Lines 1539 to 1543 in f26e407

    
                   split_special_tokens (`bool`, *optional*, defaults to `False`): 
        
                       Whether or not the special tokens should be split during the tokenization process. The default behavior is 
        
                       to not split special tokens. This means that if `<s>` is the `bos_token`, then `tokenizer.tokenize("<s>") = 
        
                       ['<s>`]. Otherwise, if `split_special_tokens=True`, then `tokenizer.tokenize("<s>")` will be give `['<', 
        
                       's', '>']`. This argument is only supported for `slow` tokenizers for the moment.

ArthurZucker · 2024-05-10T06:36:00Z

#28648 should fix this.

ArthurZucker · 2024-05-10T06:38:08Z

cc @itazap if you can take it over! 🤗

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label May 7, 2024

itazap mentioned this issue May 13, 2024

Add split special tokens #30772

Merged

7 tasks

itazap closed this as completed in #30772 May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PreTrainedTokenizerFast._batch_encode_plus()` got an unexpected keyword argument `'split_special_tokens'` #30685

`PreTrainedTokenizerFast._batch_encode_plus()` got an unexpected keyword argument `'split_special_tokens'` #30685

fahadh4ilyas commented May 7, 2024 •

edited

NielsRogge commented May 8, 2024

ArthurZucker commented May 10, 2024

ArthurZucker commented May 10, 2024

PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'split_special_tokens' #30685

PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'split_special_tokens' #30685

Comments

fahadh4ilyas commented May 7, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

NielsRogge commented May 8, 2024

ArthurZucker commented May 10, 2024

ArthurZucker commented May 10, 2024

`PreTrainedTokenizerFast._batch_encode_plus()` got an unexpected keyword argument `'split_special_tokens'` #30685

`PreTrainedTokenizerFast._batch_encode_plus()` got an unexpected keyword argument `'split_special_tokens'` #30685

fahadh4ilyas commented May 7, 2024 •

edited