Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treatment of hyphenated words #1507

Closed
rattle99 opened this issue Apr 19, 2024 · 2 comments
Closed

Treatment of hyphenated words #1507

rattle99 opened this issue Apr 19, 2024 · 2 comments
Labels

Comments

@rattle99
Copy link

It seems huggingface tokenizers, treats hyphenated words as separate, including the hyphen with reference to the word_ids() function.

For example, in the sentence

'To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth .'

Using

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

text = "To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth ."

text_tokenized = tokenizer(text , padding = 'longest' , truncation = True, return_tensors = "pt", is_split_into_words=False )
print(text_tokenized.word_ids())

returns
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, None]

but changing two-week to twoweek changes it to
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, None]

It is possible that this behavior is seen for symbols other than hyphen as well which is worth keeping in mind while doing NER tasks. Or perhaps this is by design?

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 20, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2024
@ArthurZucker
Copy link
Collaborator

Hey I think this is heavily related to the design of Bert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants