You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added to the issue: a script contributed by Arthur Stenzel (thanks Arthur!).
# TIKA Script# Andreas Niekler <aniekler [at] informatik.uni-leipzig.de># Gregor Wiedemann <gregor.wiedemann [at] uni-leipzig.de># ===========# Define function to extract text with Tika# Tika Java Archive has to be copied to working directory, current version: tika-app-1.15.jartikaExtractTextFromFile<-function(file, sourceFolder, targetFolder){
command<- paste0("java -jar tika-app-1.15.jar --text ", sourceFolder, file)
output<- system(command, intern=TRUE) # execute Tika via shelloutput<- iconv(output, to="UTF-8")
fileConn<-file(paste0(targetFolder, file, ".txt"), encoding="UTF-8")
writeLines(output, fileConn)
close(fileConn)
}
# define input foldersourceFolder<-"./data_X/"myFiles<- list.files(path=sourceFolder, pattern=NULL, # use pattern argument for specific file types, e.g. PDF: "pdf$"full.names=FALSE, recursive=FALSE,
include.dirs=FALSE)
# define output foldertargetFolder<-"./data_X_txt/"# iterate over files in input folder, extract text and save to output folderfor (filenameinmyFiles) {
cat("Extracting from ", filename, "...\n")
tikaExtractTextFromFile(filename, sourceFolder=sourceFolder, targetFolder=targetFolder)
}
From quanteda issue #380:
Thanks @BobMuenchen.
The text was updated successfully, but these errors were encountered: