Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hunspell_stem error: basic_string::_M_construct null not valid #49

Open
zeuner opened this issue Feb 28, 2021 · 4 comments
Open

hunspell_stem error: basic_string::_M_construct null not valid #49

zeuner opened this issue Feb 28, 2021 · 4 comments

Comments

@zeuner
Copy link

zeuner commented Feb 28, 2021

I've been facing an issue where hunspell_stem will crash with a C++ string constructor error:

basic_string::_M_construct null not valid

After that, the stemmed output is unusable. When using it in a tokenizer, I just get the words from the error message as tokens.

I have yet to figure out a minimum failing example, which will require me to first investigate how exactly hunspell_stem was called. This is because it is currently being called from CreateDtm (textmineR package) on large text corpora. It happened to me when trying to create a document term matrix with stemming from an Italian-language text corpus, using code which already worked reliably on different text corpora in other languages. If you have any suggestions for tracking down the issue, I'd be happy to hear about them.

For now, I solved it using a small change to your library which you can find at https://github.com/zeuner/hunspell/tree/string_error . I'm checking for the NULL pointer that causes the invalid constructor call and return early. I didn't open a pull request because I don't know the library well enough to decide whether this is the correct way to solve the issue, but it works for me. If you think it is, feel free to pull it in. If you want to do so, you might want to check whether it's permissible to just return a default-constructed Rcpp::CharacterVector. This might be the case because R_hunspell_stem seems to even allow out[i] not to be set at all.

@jeroen
Copy link
Member

jeroen commented Feb 28, 2021

Can you please include example code so that I can reproduce the crash?

@zeuner
Copy link
Author

zeuner commented Mar 2, 2021

I was able to reduce the offending stemming operation to this fairly minimal example code that crashes for me:

library(hunspell)

stem <- function (x) {
    hunspell_stem(x, "it_IT")
}

stem(c("altresı"))

I'm not sure whether the stemmed word is correct Italian. But I think hunspell should at least avoid crashing with a cryptic error message even if it encounters low-quality input.

@jeroen
Copy link
Member

jeroen commented Mar 3, 2021

Thanks. I don't have an italian dictionary myself, can you please tell which OS you run and where you got the dictionary, so that I can test it?

@zeuner
Copy link
Author

zeuner commented Mar 7, 2021

I encountered the error on both Debian Buster and Ubuntu Focal Fossa, using the hunspell-it dictionary packages provided by the OS. In particular, this was hunspell-it 6.4.3 ( https://packages.ubuntu.com/focal/hunspell-it ) and 6.2.0 ( https://packages.debian.org/buster/hunspell-it ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants