Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some non-breaking spaces are not split but cause exceptions in wordstem functions #2144

Open
kbenoit opened this issue Oct 26, 2021 · 2 comments
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 26, 2021

I encountered this character when working with some Twitter data:
NARROW NO-BREAK SPACE

There could be better ways to work around this but this provides a simple fix if users want to turn off the error.

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

# contains a NARROW NO-BREAK SPACE
# https://www.fileformat.info/info/unicode/char/202f/index.htm
txt <- "सच्चे और निष्ठावान"

toks <- tokens(txt)

# not split on this space
toks
## Tokens consisting of 1 document.
## text1 :
## [1] "सच्चे"         "और निष्ठावान"

# but is a whitespace
as.character(toks) %>%
  stringi::stri_detect_regex("\\p{Z}")
## [1] FALSE  TRUE

# fails
tokens_wordstem(toks)
## Error in char_wordstem.character(types(x), language = language): whitespace detected: you can only stem tokenized texts

# fails
as.character(toks) %>%
  char_wordstem()
## Error in char_wordstem.character(.): whitespace detected: you can only stem tokenized texts

# fails
dfm(toks) %>%
  dfm_wordstem()
## Error in char_wordstem.character(featnames(x), language = language): whitespace detected: you can only stem tokenized texts
@kbenoit
Copy link
Collaborator Author

kbenoit commented Mar 30, 2023

@koheiw @odelmarcelle I noticed that the "word4" tokeniser does not handle this correctly either.

> txt <- "सच्चे और निष्ठावान"
> toks <- tokens(txt, what = "word4")
> toks
Tokens consisting of 1 document.
text1 :
[1] "सच्चे"         "और निष्ठावान"

@odelmarcelle
Copy link
Collaborator

Narrow NBSP appears to have a weird history. I found an issue that advocates not using it as word separator for mongolian language: https://unicode-org.atlassian.net/browse/ICU-10212?jql=text%20~%20%22202F%22.

I imagine this is why the default rules do not take care of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants