Performance gains using readr::read_files() #74

lmullen · 2017-04-22T09:24:45Z

readtext is great. My students will thank you.

For reading in a directory of plain text files, you can get substantial time savings (roughly 30x on my machine) by using readr::read_file() instead of read_lines() and then pasting the lines together.

Benchmarks for smallish corpus:

library(readtext)
library(microbenchmark)

files <- Sys.glob("~/dev/ats-corpus/corpus/*")
length(files)
#> [1] 641

get_texts_readr <- function(files) {
  texts <- vapply(files, readr::read_file, character(1))
  out <- data.frame(text = texts, stringsAsFactors = FALSE)
  class(out) <- c("readtext", "data.frame")
  out
}

microbenchmark(
  readtext_corpus <- readtext(files),
  readr_corpus <- get_texts_readr(files),
  times = 5
)
#> Unit: milliseconds
#>                                    expr        min         lq      mean
#>      readtext_corpus <- readtext(files) 36156.4825 37310.8008 38114.868
#>  readr_corpus <- get_texts_readr(files)   903.6408   906.0704  1041.474
#>      median        uq       max neval cld
#>  38350.7976 38865.321 39890.937     5   b
#>    912.5825  1153.461  1331.615     5  a

str(readtext_corpus)
#> Classes 'readtext' and 'data.frame': 641 obs. of  1 variable:
#>  $ text: chr  "HISTORICAL SKETCH OF THE AMERICAN TRACT SOCIETY. \nThis institution was organized in the year 18U, four years later than the Am"| __truncated__ "P VffDLff^ \n\n\n\n\nLETTERS 5 \n\n\\\\\\ \\ -Si fi;om \n\n) A SENIOR \n\nTO \n\n(4{\\ A JUNIOR PHYSICIAN, \n\n\n\nf \n\n\n\nTH"| __truncated__ "M<g*j§Ylft'gj. \n\n\n\n\n••^aA*)^ \n\n\n\nWHAT \n\n\n\nSHALL I DRINK? \n\n\n\nREUBEN D. MUSSEY, M.D., LLJ). \n\n\n\n\n\n\n\n\n\"| __truncated__ "1854 \n\n\n\n\n\n\n>< \n\n\n\n7^ \n\n\n\n* \n• \n* \n* \n\n\n\n\nSTEPHEN J. W. TABOR. \n\n\n\nOTIUM mm: UTERIS MORS ESI . \n\n\"| __truncated__ ...

str(readr_corpus)
#> Classes 'readtext' and 'data.frame': 641 obs. of  1 variable:
#>  $ text: chr  "HISTORICAL SKETCH OF THE AMERICAN TRACT SOCIETY. \nThis institution was organized in the year 18U, four years later than the Am"| __truncated__ "P VffDLff^ \n\n\n\n\nLETTERS 5 \n\n\\\\\\ \\ -Si fi;om \n\n) A SENIOR \n\nTO \n\n(4{\\ A JUNIOR PHYSICIAN, \n\n\n\nf \n\n\n\nTH"| __truncated__ "M<g*j§Ylft'gj. \n\n\n\n\n••^aA*)^ \n\n\n\nWHAT \n\n\n\nSHALL I DRINK? \n\n\n\nREUBEN D. MUSSEY, M.D., LLJ). \n\n\n\n\n\n\n\n\n\"| __truncated__ "1854 \n\n\n\n\n\n\n>< \n\n\n\n7^ \n\n\n\n* \n• \n* \n* \n\n\n\n\nSTEPHEN J. W. TABOR. \n\n\n\nOTIUM mm: UTERIS MORS ESI . \n\n\"| __truncated__ ...

If you're willing to take a dependency on readr, then I would be happy to send a PR. What do you think?

The text was updated successfully, but these errors were encountered:

kbenoit · 2017-05-15T15:06:54Z

Hi @lmullen, just getting back to this now that I have time. We're also preparing a CRAN release.

I'd love to gain 30x more performance on the most commonly read type of file (text). I have no problem with adding a readr import. If you want to issue a PR with this change, by all means go ahead!

I wonder however how much of the performance is caused by extra readtext() processing, versus the slower readLines() performance. Above you are more comparing a low-level reader to a high-level wrapper around (among other things) the readLines() reader. The only way to tell would be to write a parallel function and compare head-to-head, before killing the slower one off. (There can be only one ⚔️ )

kbenoit · 2017-05-16T07:35:44Z

I experimented with this in a branch, and it's trickier than it looks. Yes readr::read_file() is faster, but to do it with file-by-file encoding slows down the speed gains considerably (but still 2x faster). However the more difficult problem is that we are then in between the base R encoding (from file()) and the stringi encodings, which are not the same set or the same names. To solve this will involve rebasing the code in a more significant way, also addressing #37.

I'm putting this on the back burner for now, but definitely something to address in the next revision. I also think we can remove the encoding() argument and use readr::guess_encoding() instead. (Both are based on the same underlying stringi function.)

lmullen · 2017-05-16T12:35:01Z

Thanks for the update, @kbenoit. I was just about to start work on this. Sounds like I should hold off for now, but happy to help out when you say the time is right. Looking forward to your first CRAN release.

kbenoit modified the milestone: CRAN release May 15, 2017

kbenoit assigned adamobeng May 15, 2017

kbenoit added the performance label May 15, 2017

kbenoit removed this from the CRAN release milestone May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance gains using readr::read_files() #74

Performance gains using readr::read_files() #74

lmullen commented Apr 22, 2017

kbenoit commented May 15, 2017 •

edited

kbenoit commented May 16, 2017

lmullen commented May 16, 2017

Performance gains using readr::read_files() #74

Performance gains using readr::read_files() #74

Comments

lmullen commented Apr 22, 2017

kbenoit commented May 15, 2017 • edited

kbenoit commented May 16, 2017

lmullen commented May 16, 2017

kbenoit commented May 15, 2017 •

edited