Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Apache Tika types #43

Open
kbenoit opened this issue Dec 29, 2016 · 1 comment
Open

Add support for Apache Tika types #43

kbenoit opened this issue Dec 29, 2016 · 1 comment
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 29, 2016

From quanteda issue #380:

Apache Tika (https://tika.apache.org/) might be useful.
The KNIME folks just added that to their text mining nodes.

Thanks @BobMuenchen.

@kbenoit kbenoit changed the title Add support for Apache Tike types. Add support for Apache Tike types Dec 29, 2016
@kbenoit kbenoit changed the title Add support for Apache Tike types Add support for Apache Tika types Jan 2, 2017
@kbenoit
Copy link
Collaborator Author

kbenoit commented Jun 13, 2017

Added to the issue: a script contributed by Arthur Stenzel (thanks Arthur!).

# TIKA Script
# Andreas Niekler <aniekler [at] informatik.uni-leipzig.de>
# Gregor Wiedemann <gregor.wiedemann [at] uni-leipzig.de>
# ===========

# Define function to extract text with Tika
# Tika Java Archive has to be copied to working directory, current version: tika-app-1.15.jar
tikaExtractTextFromFile <- function(file, sourceFolder, targetFolder){
  
  command <- paste0("java -jar tika-app-1.15.jar --text ", sourceFolder, file)
  output <- system(command, intern = TRUE) # execute Tika via shell
  output <- iconv(output, to = "UTF-8")
  fileConn<-file(paste0(targetFolder, file, ".txt"), encoding = "UTF-8")
  writeLines(output, fileConn)
  close(fileConn)
  
}

# define input folder
sourceFolder <- "./data_X/"
myFiles <- list.files(path = sourceFolder, pattern = NULL, # use pattern argument for specific file types, e.g. PDF: "pdf$"
                      full.names = FALSE, recursive = FALSE,
                      include.dirs = FALSE)
# define output folder
targetFolder <- "./data_X_txt/"

# iterate over files in input folder, extract text and save to output folder
for (filename in myFiles) {
  cat("Extracting from ", filename, "...\n")
  tikaExtractTextFromFile(filename, sourceFolder = sourceFolder, targetFolder = targetFolder)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants