nersuite_tokenizer produces wrong results on wide character types (Unicode, UTFs) #25

fnl · 2014-01-20T23:44:57Z

There is a "Unicode-bug" in nersuite_common/tokenizer.cpp, Tokenizer::find_token_end: If the isalnum(int) test inside that method fails, the token created is always one single byte wide (because it then returns beg + 1, and beg is a size_t).
This means that any multibyte encoded texts, such as all UTFs, cannot be correctly tokenized by this tool (with the logical exception of ASCII-only containing UTF-8, naturally) because it splits [more than one byte] wide characters in two or more tokens.
This is even nastier in the case of UTF-8 encoded text, because the bug is non-obvious and only becomes apparent when special characters like non-ASCII dashes or Greek letters are present in the text.

The text was updated successfully, but these errors were encountered:

priancho · 2014-01-21T14:34:34Z

Thank you for your bug-report.

At the beginning of developing this application, we used a pre-processing program that converts Unicode characters to Ascii characters.
It is not completely same program but you can find one from https://github.com/spyysalo/unicode2ascii

I also would like to make NERsuite to handle multibyte input since non-ascii characters virtually appear in all biomedical texts.
Unfortunately, it will take some time, at least a few months, to make a time for this improvement because I am currently preparing my thesis defense presentation that will be at the beginning of Feb.

Best wishes,

fnl · 2014-01-24T08:17:23Z

Thanks for the reply and the link to a (hopefully, temporary) work-around.
Good luck with the defense!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nersuite_tokenizer produces wrong results on wide character types (Unicode, UTFs) #25

nersuite_tokenizer produces wrong results on wide character types (Unicode, UTFs) #25

fnl commented Jan 20, 2014

priancho commented Jan 21, 2014

fnl commented Jan 24, 2014

nersuite_tokenizer produces wrong results on wide character types (Unicode, UTFs) #25

nersuite_tokenizer produces wrong results on wide character types (Unicode, UTFs) #25

Comments

fnl commented Jan 20, 2014

priancho commented Jan 21, 2014

fnl commented Jan 24, 2014