Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encoding inference function #157

Open
koheiw opened this issue Jul 26, 2019 · 0 comments
Open

Add encoding inference function #157

koheiw opened this issue Jul 26, 2019 · 0 comments
Assignees

Comments

@koheiw
Copy link
Contributor

koheiw commented Jul 26, 2019

The EU manifesto example is incorrect, because Hungarian text, for example, is not in ISO-8859-1.
https://readtext.quanteda.io/articles/readtext_vignette.html#reading-one-or-more-text-files

However, it is tedious to specify encoding manually. Why not doing like this? stri_enc_detect() is making good guess.

path_data <- system.file("extdata/", package = "readtext")

for (f in list.files(paste0(path_data, "/txt/EU_manifestos/"), full.names = TRUE)) {
  print(f)
  enc <- stringi::stri_enc_detect(readBin(file(f, 'rb'), character()))
  print(enc[[1]][1:2,])
}
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       de       0.80
2 ISO-8859-9       tr       0.24
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       de       0.83
2 ISO-8859-9       tr       0.26
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       en       0.75
2 ISO-8859-2       ro       0.21
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       en       0.75
2 ISO-8859-2       ro       0.21
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       es       0.91
2 ISO-8859-2       ro       0.35
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       es       0.88
2 ISO-8859-2       ro       0.36
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fi_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       sv       0.20
2 ISO-8859-9       tr       0.17
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       fr       0.94
2 ISO-8859-2       ro       0.35
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       fr       0.92
2 ISO-8859-2       ro       0.37
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_gr_V.txt"
    Encoding Language Confidence
1 ISO-8859-7       el       0.74
2   UTF-16BE                0.10
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_hu_V.txt"
    Encoding Language Confidence
1 ISO-8859-2       hu       0.53
2 ISO-8859-1       en       0.16
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_it_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       it       0.83
2 ISO-8859-2       ro       0.43
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_lv_V.txt"
Error in enc[[1]] : subscript out of bounds
In addition: There were 13 warnings (use warnings() to see them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants