readtext sometimes produces invalid UTF-8 #108

patperry · 2017-07-13T17:24:34Z

With current development version of readtext:

> library("readtext")
> tmp <- paste0(tempfile(), ".txt")
> url <- "http://www.gutenberg.org/files/141/141.txt"
> download.file(url, tmp)
trying URL 'http://www.gutenberg.org/files/141/141.txt'
Content type 'text/plain; charset=ISO-8859-1' length 918376 bytes (896 KB)
==================================================
downloaded 896 KB

> data <- readtext(tmp)

> substr(text$text, 875935, 875940)
Error in substr(text$text, 875935, 875940) : 
  invalid multibyte string at '<a3>20,<30>00, any one who could satisfy the

See https://github.com/patperry/r-corpus/blob/master/vignettes/unicode.Rmd for more context.

Session information

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin16.5.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readtext_0.51

loaded via a namespace (and not attached):
 [1] httr_1.2.1        compiler_3.4.0    Matrix_1.2-9      R6_2.2.2         
 [5] tools_3.4.0       tibble_1.3.3      Rcpp_0.12.11      stringi_1.1.5    
 [9] grid_3.4.0        data.table_1.10.4 rlang_0.1.1       lattice_0.20-35  
[13] corpus_0.8.0

The text was updated successfully, but these errors were encountered:

kbenoit · 2017-07-13T17:45:31Z

Encoding issue... the file is Latin1, so you need to specify the conversion at input. Could be the "£" symbol.

system2("file", tmp)
## /var/folders/46/zfn6gwj15d3_n6dhyy1cvwc00000gp/T//RtmpL5BugH/filee0d4589b4674.txt: ISO-8859 text, with CRLF line terminators

data <- readtext(tmp, encoding = "ISO-8859-1")

cat(substr(data$text, 875900, 875960))
## command of her beauty, and her £20,000, any one who could sat

Nice vignette!

patperry · 2017-07-13T18:18:35Z

I learned from the best!

Would it be possible for readtext to detect that the text is invalid? Something like this:

If the user does not specify the encoding, do the following:

Try reading in the data as UTF-8, then validate it.
If the text is valid UTF-8, succeed.
Otherwise, if the data is not valid UTF-8, then re-read the data as Latin-1, and display a warning to the user.

On step 3, instead of assuming Latin-1, you could try some automatic encoding detection (still with a warning to the user). My prior is a high probability of Latin-1, so it might not be worth it to detect the encoding automatically.

kbenoit · 2017-07-13T18:40:56Z

very doable, since we have the function encoding().

data <- readtext(tmp)
encoding(data)
## readtext object consisting of 1 document and 0 docvars.
## # data.frame [1 x 2]
##                     doc_id                text
##                      <chr>               <chr>
## 1 filee0d4589b4674.txt "\"The Projec\"..."
## Probable encoding: ISO-8859-1
## (but note: detector often reports ISO-8859-1 when encoding is actually UTF-8.)

kbenoit · 2017-11-08T13:59:49Z

Does your new package utf8 offer a way to solve this?

patperry · 2017-11-08T16:49:13Z

You could check for validity with utf8::utf8_valid. You already import stringi, though, so there's no need to add another dependency. You can check for validity with stringi::stri_enc_isutf8 and then guess the encoding for the invalid strings with stringi::stri_enc_detect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readtext sometimes produces invalid UTF-8 #108

readtext sometimes produces invalid UTF-8 #108

patperry commented Jul 13, 2017

kbenoit commented Jul 13, 2017

patperry commented Jul 13, 2017

kbenoit commented Jul 13, 2017

kbenoit commented Nov 8, 2017

patperry commented Nov 8, 2017

readtext sometimes produces invalid UTF-8 #108

readtext sometimes produces invalid UTF-8 #108

Comments

patperry commented Jul 13, 2017

kbenoit commented Jul 13, 2017

patperry commented Jul 13, 2017

kbenoit commented Jul 13, 2017

kbenoit commented Nov 8, 2017

patperry commented Nov 8, 2017