Some non-breaking spaces are not split but cause exceptions in wordstem functions #2144

kbenoit · 2021-10-26T16:13:39Z

I encountered this character when working with some Twitter data:
NARROW NO-BREAK SPACE

There could be better ways to work around this but this provides a simple fix if users want to turn off the error.

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

# contains a NARROW NO-BREAK SPACE
# https://www.fileformat.info/info/unicode/char/202f/index.htm
txt <- "सच्चे और निष्ठावान"

toks <- tokens(txt)

# not split on this space
toks
## Tokens consisting of 1 document.
## text1 :
## [1] "सच्चे"         "और निष्ठावान"

# but is a whitespace
as.character(toks) %>%
  stringi::stri_detect_regex("\\p{Z}")
## [1] FALSE  TRUE

# fails
tokens_wordstem(toks)
## Error in char_wordstem.character(types(x), language = language): whitespace detected: you can only stem tokenized texts

# fails
as.character(toks) %>%
  char_wordstem()
## Error in char_wordstem.character(.): whitespace detected: you can only stem tokenized texts

# fails
dfm(toks) %>%
  dfm_wordstem()
## Error in char_wordstem.character(featnames(x), language = language): whitespace detected: you can only stem tokenized texts

kbenoit · 2023-03-30T06:38:31Z

@koheiw @odelmarcelle I noticed that the "word4" tokeniser does not handle this correctly either.

> txt <- "सच्चे और निष्ठावान"
> toks <- tokens(txt, what = "word4")
> toks
Tokens consisting of 1 document.
text1 :
[1] "सच्चे"         "और निष्ठावान"

odelmarcelle · 2023-03-30T13:41:01Z

Narrow NBSP appears to have a weird history. I found an issue that advocates not using it as word separator for mongolian language: https://unicode-org.atlassian.net/browse/ICU-10212?jql=text%20~%20%22202F%22.

I imagine this is why the default rules do not take care of it.

kbenoit self-assigned this Oct 26, 2021

kbenoit mentioned this issue Oct 26, 2021

Add check_whitespace to char_wordstem() #2145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some non-breaking spaces are not split but cause exceptions in wordstem functions #2144

Some non-breaking spaces are not split but cause exceptions in wordstem functions #2144

kbenoit commented Oct 26, 2021

kbenoit commented Mar 30, 2023

odelmarcelle commented Mar 30, 2023

Some non-breaking spaces are not split but cause exceptions in wordstem functions #2144

Some non-breaking spaces are not split but cause exceptions in wordstem functions #2144

Comments

kbenoit commented Oct 26, 2021

kbenoit commented Mar 30, 2023

odelmarcelle commented Mar 30, 2023