Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance gains using readr::read_files() #74

Open
lmullen opened this issue Apr 22, 2017 · 3 comments
Open

Performance gains using readr::read_files() #74

lmullen opened this issue Apr 22, 2017 · 3 comments
Assignees

Comments

@lmullen
Copy link

lmullen commented Apr 22, 2017

readtext is great. My students will thank you.

For reading in a directory of plain text files, you can get substantial time savings (roughly 30x on my machine) by using readr::read_file() instead of read_lines() and then pasting the lines together.

Benchmarks for smallish corpus:

library(readtext)
library(microbenchmark)

files <- Sys.glob("~/dev/ats-corpus/corpus/*")
length(files)
#> [1] 641

get_texts_readr <- function(files) {
  texts <- vapply(files, readr::read_file, character(1))
  out <- data.frame(text = texts, stringsAsFactors = FALSE)
  class(out) <- c("readtext", "data.frame")
  out
}

microbenchmark(
  readtext_corpus <- readtext(files),
  readr_corpus <- get_texts_readr(files),
  times = 5
)
#> Unit: milliseconds
#>                                    expr        min         lq      mean
#>      readtext_corpus <- readtext(files) 36156.4825 37310.8008 38114.868
#>  readr_corpus <- get_texts_readr(files)   903.6408   906.0704  1041.474
#>      median        uq       max neval cld
#>  38350.7976 38865.321 39890.937     5   b
#>    912.5825  1153.461  1331.615     5  a

str(readtext_corpus)
#> Classes 'readtext' and 'data.frame': 641 obs. of  1 variable:
#>  $ text: chr  "HISTORICAL SKETCH OF THE AMERICAN TRACT SOCIETY. \nThis institution was organized in the year 18U, four years later than the Am"| __truncated__ "P VffDLff^ \n\n\n\n\nLETTERS 5 \n\n\\\\\\ \\ -Si fi;om \n\n) A SENIOR \n\nTO \n\n(4{\\ A JUNIOR PHYSICIAN, \n\n\n\nf \n\n\n\nTH"| __truncated__ "M<g*j§Ylft'gj. \n\n\n\n\n••^aA*)^ \n\n\n\nWHAT \n\n\n\nSHALL I DRINK? \n\n\n\nREUBEN D. MUSSEY, M.D., LLJ). \n\n\n\n\n\n\n\n\n\"| __truncated__ "1854 \n\n\n\n\n\n\n>< \n\n\n\n7^ \n\n\n\n* \n• \n* \n* \n\n\n\n\nSTEPHEN J. W. TABOR. \n\n\n\nOTIUM mm: UTERIS MORS ESI . \n\n\"| __truncated__ ...

str(readr_corpus)
#> Classes 'readtext' and 'data.frame': 641 obs. of  1 variable:
#>  $ text: chr  "HISTORICAL SKETCH OF THE AMERICAN TRACT SOCIETY. \nThis institution was organized in the year 18U, four years later than the Am"| __truncated__ "P VffDLff^ \n\n\n\n\nLETTERS 5 \n\n\\\\\\ \\ -Si fi;om \n\n) A SENIOR \n\nTO \n\n(4{\\ A JUNIOR PHYSICIAN, \n\n\n\nf \n\n\n\nTH"| __truncated__ "M<g*j§Ylft'gj. \n\n\n\n\n••^aA*)^ \n\n\n\nWHAT \n\n\n\nSHALL I DRINK? \n\n\n\nREUBEN D. MUSSEY, M.D., LLJ). \n\n\n\n\n\n\n\n\n\"| __truncated__ "1854 \n\n\n\n\n\n\n>< \n\n\n\n7^ \n\n\n\n* \n• \n* \n* \n\n\n\n\nSTEPHEN J. W. TABOR. \n\n\n\nOTIUM mm: UTERIS MORS ESI . \n\n\"| __truncated__ ...

If you're willing to take a dependency on readr, then I would be happy to send a PR. What do you think?

@kbenoit
Copy link
Collaborator

kbenoit commented May 15, 2017

Hi @lmullen, just getting back to this now that I have time. We're also preparing a CRAN release.

I'd love to gain 30x more performance on the most commonly read type of file (text). I have no problem with adding a readr import. If you want to issue a PR with this change, by all means go ahead!

I wonder however how much of the performance is caused by extra readtext() processing, versus the slower readLines() performance. Above you are more comparing a low-level reader to a high-level wrapper around (among other things) the readLines() reader. The only way to tell would be to write a parallel function and compare head-to-head, before killing the slower one off. (There can be only one ⚔️ )

@kbenoit
Copy link
Collaborator

kbenoit commented May 16, 2017

I experimented with this in a branch, and it's trickier than it looks. Yes readr::read_file() is faster, but to do it with file-by-file encoding slows down the speed gains considerably (but still 2x faster). However the more difficult problem is that we are then in between the base R encoding (from file()) and the stringi encodings, which are not the same set or the same names. To solve this will involve rebasing the code in a more significant way, also addressing #37.

I'm putting this on the back burner for now, but definitely something to address in the next revision. I also think we can remove the encoding() argument and use readr::guess_encoding() instead. (Both are based on the same underlying stringi function.)

@kbenoit kbenoit removed this from the CRAN release milestone May 16, 2017
@lmullen
Copy link
Author

lmullen commented May 16, 2017

Thanks for the update, @kbenoit. I was just about to start work on this. Sounds like I should hold off for now, but happy to help out when you say the time is right. Looking forward to your first CRAN release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants