Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3 tokenizer with Incorrect offset_mapping #1517

Closed
justin-shao opened this issue Apr 27, 2024 · 2 comments
Closed

Llama3 tokenizer with Incorrect offset_mapping #1517

justin-shao opened this issue Apr 27, 2024 · 2 comments
Labels

Comments

@justin-shao
Copy link

When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.

Example:

model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")
print(tokenizer(["Sample input"], return_offsets_mapping=True))

will yield:

{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}

Offset_mapping should have tuples representing (char_start, char_end) for each token.

@ArthurZucker
Copy link
Collaborator

Hey! This seems to be expected no? The documentation might be wrong, but there are no offsets here (trim_offsets is set to False I think):
['Sample', 'Ġinput'] are the two tokens

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 31, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants