Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nersuite_tokenizer produces wrong results on wide character types (Unicode, UTFs) #25

Open
fnl opened this issue Jan 20, 2014 · 2 comments

Comments

@fnl
Copy link

fnl commented Jan 20, 2014

There is a "Unicode-bug" in nersuite_common/tokenizer.cpp, Tokenizer::find_token_end: If the isalnum(int) test inside that method fails, the token created is always one single byte wide (because it then returns beg + 1, and beg is a size_t).
This means that any multibyte encoded texts, such as all UTFs, cannot be correctly tokenized by this tool (with the logical exception of ASCII-only containing UTF-8, naturally) because it splits [more than one byte] wide characters in two or more tokens.
This is even nastier in the case of UTF-8 encoded text, because the bug is non-obvious and only becomes apparent when special characters like non-ASCII dashes or Greek letters are present in the text.

@priancho
Copy link
Member

Thank you for your bug-report.

At the beginning of developing this application, we used a pre-processing program that converts Unicode characters to Ascii characters.
It is not completely same program but you can find one from https://github.com/spyysalo/unicode2ascii

I also would like to make NERsuite to handle multibyte input since non-ascii characters virtually appear in all biomedical texts.
Unfortunately, it will take some time, at least a few months, to make a time for this improvement because I am currently preparing my thesis defense presentation that will be at the beginning of Feb.

Best wishes,

@fnl
Copy link
Author

fnl commented Jan 24, 2014

Thanks for the reply and the link to a (hopefully, temporary) work-around.
Good luck with the defense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants