Encoding handling not handled by stringi and possibly inconsistent #37

kbenoit · 2016-11-19T17:28:18Z

Our README states:

(All ecnoding functions are handled by the stringi package.)

But this is hardly true, since we use the base iconv() that happens through file() in get-functions.R, not stringi.

We should go through carefully to ensure consistency, and also change our claims to be accurate.

The text was updated successfully, but these errors were encountered:

adamobeng · 2017-01-02T12:46:45Z

By my reckoning,

get_txt, get_csv, get_json_tweets, get_json_lines use readLines.
get_json_object uses jsonlite::fromJSON
get_XML uses XML::xmlTreeParse or XML::xmlToDataFrame
get_html, get_docx use XML::htmlTreeParse
get_pdf uses the pdf2text command line utility (set to output UTF-8)
get_doc uses the antiword command line utility

We could replace all of the readLines calls with stri_read_lines (although that function is labelled experimental). Presumably jsonlite and XML know how to deal with their encodings, which leaves html and doc. XML::htmlTreeParse has an encoding option, but I don't think stringi is designed to autodetect encoding of marked-up text. I'm not sure what to do with antiword, it doesn't look like you can specify an output encoding, which means it might be platform-dependent...

adamobeng · 2017-01-09T18:49:05Z

I should also note that we don't currently "include functions for diagnosing encodings on a file-by-file basis", because the stringi encoding detection stuff is not currently exposed.

kbenoit · 2017-05-16T17:03:26Z

I'm putting this on the long list for the next release.

kbenoit added the pre-CRAN label Dec 30, 2016

kbenoit modified the milestone: CRAN release Dec 30, 2016

adamobeng self-assigned this Jan 2, 2017

kbenoit mentioned this issue Mar 6, 2017

bug in encoding for txt files #55

Closed

kbenoit added this to Documentation-related structure in Docathon for readtext Mar 6, 2017

kbenoit added Difficulty: Hard Documentation and removed pre-CRAN labels Mar 6, 2017

kbenoit modified the milestone: CRAN release May 15, 2017

kbenoit mentioned this issue May 16, 2017

Performance gains using readr::read_files() #74

Open

kbenoit removed this from the CRAN release milestone May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding handling not handled by stringi and possibly inconsistent #37

Encoding handling not handled by stringi and possibly inconsistent #37

kbenoit commented Nov 19, 2016

adamobeng commented Jan 2, 2017

adamobeng commented Jan 9, 2017

kbenoit commented May 16, 2017

Encoding handling not handled by stringi and possibly inconsistent #37

Encoding handling not handled by stringi and possibly inconsistent #37

Comments

kbenoit commented Nov 19, 2016

adamobeng commented Jan 2, 2017

adamobeng commented Jan 9, 2017

kbenoit commented May 16, 2017