You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems huggingface tokenizers, treats hyphenated words as separate, including the hyphen with reference to the word_ids() function.
For example, in the sentence
'To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth .'
Using
tokenizer=AutoTokenizer.from_pretrained('distilbert-base-uncased')
text="To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth ."text_tokenized=tokenizer(text , padding='longest' , truncation=True, return_tensors="pt", is_split_into_words=False )
print(text_tokenized.word_ids())
but changing two-week to twoweek changes it to [None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, None]
It is possible that this behavior is seen for symbols other than hyphen as well which is worth keeping in mind while doing NER tasks. Or perhaps this is by design?
The text was updated successfully, but these errors were encountered:
It seems huggingface tokenizers, treats hyphenated words as separate, including the hyphen with reference to the
word_ids()
function.For example, in the sentence
Using
returns
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, None]
but changing
two-week
totwoweek
changes it to[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, None]
It is possible that this behavior is seen for symbols other than hyphen as well which is worth keeping in mind while doing NER tasks. Or perhaps this is by design?
The text was updated successfully, but these errors were encountered: