Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146

Open
MSU2580 opened this issue Feb 9, 2019 · 9 comments

Comments

@MSU2580
Copy link

MSU2580 commented Feb 9, 2019

After reading in a corpus of scientific work in the following manner, we eventually get an object back out (FinalSciCorp) which contains all unique words within the corpus, along with some other information:

> tempsci <-  readtext("*.txt",encoding = "UTF-8") 
> sciCorp <- corpus(tempsci)
 >  doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE)
> FinalSciCorp = textstat_frequency(doc_term_matrix)

However, FinalSciCorp still contains some words with ligatures such as "ff" and "fi", among others. As an example, FinalSciCorp contains both the words "field" and "<U+FB01>eld", or in another case just the word "signi<U+FB01>cant". The 'encoding' and 'stri_enc_detect' functions both indicate that the files are likely "UTF-8" although we have also tried many other options, including "latin1" for encoding.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 10, 2019

Can you send me a link to one of the documents containing a ligature, as well as your sessionInfo() output, so I can test it?

@MSU2580
Copy link
Author

MSU2580 commented Feb 11, 2019 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 11, 2019

Thanks - that will of course read the ligatures because they are part of your text file. readtext does not convert ligatures if the .txt file contains them.

But if you are converting from a pdf file, then we can help. My own tests showed that ligatures are correctly processed. Did you read the text above using readtext() where the source file was a pdf?

@MSU2580
Copy link
Author

MSU2580 commented Feb 11, 2019 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 11, 2019

One option to you is "Unicode normalization", which will convert the ligatures. (But I am not sure why you have a space above after each ligature - this would need to be dealt with separately, if it's actually part of your file.)

txt <- "substantial investments in human and financial 
resources for the effect of earlier diagnosis of waffles."

cat(stringi::stri_trans_nfkc(txt))
## substantial investments in human and financial 
## resources for the effect of earlier diagnosis of waffles.

@MSU2580
Copy link
Author

MSU2580 commented Feb 25, 2019 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 25, 2019

Thanks but please either upload them by dragging into the GitHub browser, or send them to me by email. (They did not show up above.)

@MSU2580
Copy link
Author

MSU2580 commented Feb 25, 2019

hep-th9910196.pdf

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 25, 2019

Thanks. That definitely contains the ligatures, and readtext::readtext() definitely does not normalize them. We can think about adding an option to readtext to do this automatically, but in the meantime, you can solve this "manually" using stringi:

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
rtxt <- readtext::readtext("~/Downloads/hep-th9910196.pdf")

texts(rtxt) %>%
  kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##                                                        
##  [hep-th9910196.pdf, 1523] the |  infinity   | so      
##  [hep-th9910196.pdf, 1614] the | fluctuation  | (       
##  [hep-th9910196.pdf, 1812] the | differential | equation
##  [hep-th9910196.pdf, 1826] the | coefficients  | of      
##  [hep-th9910196.pdf, 1987] the |   infinity   | ,       
##  [hep-th9910196.pdf, 2928] the |   infinity   | r       
##  [hep-th9910196.pdf, 3070] the |   infinity   | ,       
##  [hep-th9910196.pdf, 3760] are |    cutoff    | for

texts(rtxt) %>%
  stringi::stri_trans_nfkc() %>%
  kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##                                             
##  [text1, 1523] the |   infinity   | so      
##  [text1, 1614] the | fluctuation  | (       
##  [text1, 1812] the | differential | equation
##  [text1, 1826] the | coefficients | of      
##  [text1, 1987] the |   infinity   | ,       
##  [text1, 2928] the |   infinity   | r       
##  [text1, 3070] the |   infinity   | ,       
##  [text1, 3760] are |    cutoff    | for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants