You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered this character when working with some Twitter data: NARROW NO-BREAK SPACE
There could be better ways to work around this but this provides a simple fix if users want to turn off the error.
library("quanteda")
## Package version: 3.1.0## Unicode version: 13.0## ICU version: 69.1## Parallel computing: 12 of 12 threads used.## See https://quanteda.io for tutorials and examples.# contains a NARROW NO-BREAK SPACE# https://www.fileformat.info/info/unicode/char/202f/index.htmtxt<-"सच्चे और निष्ठावान"toks<- tokens(txt)
# not split on this spacetoks## Tokens consisting of 1 document.## text1 :## [1] "सच्चे" "और निष्ठावान"# but is a whitespace
as.character(toks) %>%
stringi::stri_detect_regex("\\p{Z}")
## [1] FALSE TRUE# fails
tokens_wordstem(toks)
## Error in char_wordstem.character(types(x), language = language): whitespace detected: you can only stem tokenized texts# fails
as.character(toks) %>%
char_wordstem()
## Error in char_wordstem.character(.): whitespace detected: you can only stem tokenized texts# fails
dfm(toks) %>%
dfm_wordstem()
## Error in char_wordstem.character(featnames(x), language = language): whitespace detected: you can only stem tokenized texts
The text was updated successfully, but these errors were encountered:
I encountered this character when working with some Twitter data:
NARROW NO-BREAK SPACE
There could be better ways to work around this but this provides a simple fix if users want to turn off the error.
The text was updated successfully, but these errors were encountered: