Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implicit conversion of character input to UTF-8 #87

Open
ablaette opened this issue Feb 15, 2024 · 0 comments
Open

implicit conversion of character input to UTF-8 #87

ablaette opened this issue Feb 15, 2024 · 0 comments

Comments

@ablaette
Copy link

tokenize_words() implicitly converts non-UTF-8-input to UTF-8. See the following example (latin1 in, UTF-8 out). As I was not aware of this behavior, this had caused me some headaches (see PolMine/cwbtools#8 (comment)).

library(tokenizers)

c("Smørrebrød tastes great!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  .[[1]] %>%
  Encoding()

[1] "UTF-8" "unknown" "unknown" "unknown"

Obviously, the times of 'latin1' are almost entirely over. But the documentation of the package is silent on this, the only reference to matters of encoding is in the 'Description' part of the DESCRIPTION file: "The tokenizers have a consistent interface, and
the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'."

Maybe include a sentence like this in the 'basic-tokenizers' documentation object? "Non-UTF-8 input is converted to UTF-8."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant