Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding handling not handled by stringi and possibly inconsistent #37

Open
kbenoit opened this issue Nov 19, 2016 · 3 comments
Open

Encoding handling not handled by stringi and possibly inconsistent #37

kbenoit opened this issue Nov 19, 2016 · 3 comments

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 19, 2016

Our README states:

(All ecnoding functions are handled by the stringi package.)

But this is hardly true, since we use the base iconv() that happens through file() in get-functions.R, not stringi.

We should go through carefully to ensure consistency, and also change our claims to be accurate.

@kbenoit kbenoit modified the milestone: CRAN release Dec 30, 2016
@adamobeng
Copy link
Collaborator

By my reckoning,

  • get_txt, get_csv, get_json_tweets, get_json_lines use readLines.
  • get_json_object uses jsonlite::fromJSON
  • get_XML uses XML::xmlTreeParse or XML::xmlToDataFrame
  • get_html, get_docx use XML::htmlTreeParse
  • get_pdf uses the pdf2text command line utility (set to output UTF-8)
  • get_doc uses the antiword command line utility

We could replace all of the readLines calls with stri_read_lines (although that function is labelled experimental). Presumably jsonlite and XML know how to deal with their encodings, which leaves html and doc. XML::htmlTreeParse has an encoding option, but I don't think stringi is designed to autodetect encoding of marked-up text. I'm not sure what to do with antiword, it doesn't look like you can specify an output encoding, which means it might be platform-dependent...

@adamobeng adamobeng self-assigned this Jan 2, 2017
@adamobeng
Copy link
Collaborator

I should also note that we don't currently "include functions for diagnosing encodings on a file-by-file basis", because the stringi encoding detection stuff is not currently exposed.

@kbenoit
Copy link
Collaborator Author

kbenoit commented May 16, 2017

I'm putting this on the long list for the next release.

@kbenoit kbenoit removed this from the CRAN release milestone May 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Docathon for readtext
Doc-related structure
Development

No branches or pull requests

2 participants