Treatment of hyphenated words #1507

rattle99 · 2024-04-19T10:41:44Z

It seems huggingface tokenizers, treats hyphenated words as separate, including the hyphen with reference to the word_ids() function.

For example, in the sentence

'To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth .'

Using

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

text = "To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth ."

text_tokenized = tokenizer(text , padding = 'longest' , truncation = True, return_tensors = "pt", is_split_into_words=False )
print(text_tokenized.word_ids())

returns
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, None]

but changing two-week to twoweek changes it to
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, None]

It is possible that this behavior is seen for symbols other than hyphen as well which is worth keeping in mind while doing NER tasks. Or perhaps this is by design?

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-20T01:50:25Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-06-11T13:28:24Z

Hey I think this is heavily related to the design of Bert

github-actions bot added the Stale label May 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treatment of hyphenated words #1507

Treatment of hyphenated words #1507

rattle99 commented Apr 19, 2024

github-actions bot commented May 20, 2024

ArthurZucker commented Jun 11, 2024

Treatment of hyphenated words #1507

Treatment of hyphenated words #1507

Comments

rattle99 commented Apr 19, 2024

github-actions bot commented May 20, 2024

ArthurZucker commented Jun 11, 2024