Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readtext sometimes produces invalid UTF-8 #108

Open
patperry opened this issue Jul 13, 2017 · 5 comments
Open

readtext sometimes produces invalid UTF-8 #108

patperry opened this issue Jul 13, 2017 · 5 comments

Comments

@patperry
Copy link

With current development version of readtext:

> library("readtext")
> tmp <- paste0(tempfile(), ".txt")
> url <- "http://www.gutenberg.org/files/141/141.txt"
> download.file(url, tmp)
trying URL 'http://www.gutenberg.org/files/141/141.txt'
Content type 'text/plain; charset=ISO-8859-1' length 918376 bytes (896 KB)
==================================================
downloaded 896 KB

> data <- readtext(tmp)

> substr(text$text, 875935, 875940)
Error in substr(text$text, 875935, 875940) : 
  invalid multibyte string at '<a3>20,<30>00, any one who could satisfy the

See https://github.com/patperry/r-corpus/blob/master/vignettes/unicode.Rmd for more context.


Session information

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin16.5.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readtext_0.51

loaded via a namespace (and not attached):
 [1] httr_1.2.1        compiler_3.4.0    Matrix_1.2-9      R6_2.2.2         
 [5] tools_3.4.0       tibble_1.3.3      Rcpp_0.12.11      stringi_1.1.5    
 [9] grid_3.4.0        data.table_1.10.4 rlang_0.1.1       lattice_0.20-35  
[13] corpus_0.8.0   
@kbenoit
Copy link
Collaborator

kbenoit commented Jul 13, 2017

Encoding issue... the file is Latin1, so you need to specify the conversion at input. Could be the "£" symbol.

system2("file", tmp)
## /var/folders/46/zfn6gwj15d3_n6dhyy1cvwc00000gp/T//RtmpL5BugH/filee0d4589b4674.txt: ISO-8859 text, with CRLF line terminators

data <- readtext(tmp, encoding = "ISO-8859-1")

cat(substr(data$text, 875900, 875960))
## command of her beauty, and her £20,000, any one who could sat

Nice vignette!

@patperry
Copy link
Author

I learned from the best!

Would it be possible for readtext to detect that the text is invalid? Something like this:

If the user does not specify the encoding, do the following:

  1. Try reading in the data as UTF-8, then validate it.
  2. If the text is valid UTF-8, succeed.
  3. Otherwise, if the data is not valid UTF-8, then re-read the data as Latin-1, and display a warning to the user.

On step 3, instead of assuming Latin-1, you could try some automatic encoding detection (still with a warning to the user). My prior is a high probability of Latin-1, so it might not be worth it to detect the encoding automatically.

@kbenoit
Copy link
Collaborator

kbenoit commented Jul 13, 2017

very doable, since we have the function encoding().

data <- readtext(tmp)
encoding(data)
## readtext object consisting of 1 document and 0 docvars.
## # data.frame [1 x 2]
##                     doc_id                text
##                      <chr>               <chr>
## 1 filee0d4589b4674.txt "\"The Projec\"..."
## Probable encoding: ISO-8859-1
## (but note: detector often reports ISO-8859-1 when encoding is actually UTF-8.)

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 8, 2017

Does your new package utf8 offer a way to solve this?

@patperry
Copy link
Author

patperry commented Nov 8, 2017

You could check for validity with utf8::utf8_valid. You already import stringi, though, so there's no need to add another dependency. You can check for validity with stringi::stri_enc_isutf8 and then guess the encoding for the invalid strings with stringi::stri_enc_detect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants