Llama3 tokenizer with Incorrect offset_mapping #1517

justin-shao · 2024-04-27T01:33:56Z

When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.

Example:

model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")
print(tokenizer(["Sample input"], return_offsets_mapping=True))

will yield:

{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}

Offset_mapping should have tuples representing (char_start, char_end) for each token.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-04-30T15:15:04Z

Hey! This seems to be expected no? The documentation might be wrong, but there are no offsets here (trim_offsets is set to False I think):
['Sample', 'Ġinput'] are the two tokens

github-actions · 2024-05-31T01:50:48Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label May 31, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 tokenizer with Incorrect offset_mapping #1517

Llama3 tokenizer with Incorrect offset_mapping #1517

justin-shao commented Apr 27, 2024

ArthurZucker commented Apr 30, 2024

github-actions bot commented May 31, 2024

Llama3 tokenizer with Incorrect offset_mapping #1517

Llama3 tokenizer with Incorrect offset_mapping #1517

Comments

justin-shao commented Apr 27, 2024

ArthurZucker commented Apr 30, 2024

github-actions bot commented May 31, 2024