You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
get_txt, get_csv, get_json_tweets, get_json_lines use readLines.
get_json_object uses jsonlite::fromJSON
get_XML uses XML::xmlTreeParse or XML::xmlToDataFrame
get_html, get_docx use XML::htmlTreeParse
get_pdf uses the pdf2text command line utility (set to output UTF-8)
get_doc uses the antiword command line utility
We could replace all of the readLines calls with stri_read_lines (although that function is labelled experimental). Presumably jsonlite and XML know how to deal with their encodings, which leaves html and doc. XML::htmlTreeParse has an encoding option, but I don't think stringi is designed to autodetect encoding of marked-up text. I'm not sure what to do with antiword, it doesn't look like you can specify an output encoding, which means it might be platform-dependent...
I should also note that we don't currently "include functions for diagnosing encodings on a file-by-file basis", because the stringi encoding detection stuff is not currently exposed.
Our README states:
But this is hardly true, since we use the base
iconv()
that happens throughfile()
inget-functions.R
, not stringi.We should go through carefully to ensure consistency, and also change our claims to be accurate.
The text was updated successfully, but these errors were encountered: