Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anntoconll : fix word with accents being split by tokenization #1307

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Marny30
Copy link

@Marny30 Marny30 commented Feb 8, 2019

In the current version, the anntoconll tool will split a word containing accents into different tokens, isolating the accents as if they were words. When working with european languages such as French, Spanish, etc or even German with the ß this comes to be a problem.

For example, the text "déjà fait" would be split into tokens "d", "é", "j", "à", "fait" instead of "déjà", "fait"

By adding all accents range À-ÿ to the tokenization regex, this tokenization issue doesn't happen anymore.

@Marny30 Marny30 changed the title fix word with accents being split by tokenization anntoconnl : fix word with accents being split by tokenization Feb 8, 2019
@Marny30 Marny30 changed the title anntoconnl : fix word with accents being split by tokenization anntoconll : fix word with accents being split by tokenization Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant