Tika Page Extractor

This application set up a server that extracts content from PDFs and returns it as JSON string, one entry per page. Additionally, meta data of the file will be returned, too.

-> Download zipped Binary

Start server using:

java -jar TikaPageExtractor.jar

Adding -h to the line above will print command line options you can set, e.g. the server's port and ip address, or options to exclude meta data extraction and the like.

By default, the server will listen to port 9090.

You can PUT pdf file contents to the server, similar to Solr. Using curl you can do the following, for example:

curl -X PUT -T file.pdf http://localhost:9090/

This will upload file.pdf to the server and return its content as JSON. The server will return document metadata and an array of strings, one entry per page.

The server was inspired by http://vteams.com/blog/apache-tika-per-page-content-extraction/. Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src/main/java/de/auxnet		src/main/java/de/auxnet
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml
tika_page_extractor.service		tika_page_extractor.service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main/java/de/auxnet

src/main/java/de/auxnet

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

pom.xml

pom.xml

tika_page_extractor.service

tika_page_extractor.service

Repository files navigation

Tika Page Extractor

About

Releases

Packages

Languages

License

mkalus/tika-page-extractor

Folders and files

Latest commit

History

Repository files navigation

Tika Page Extractor

About

Topics

Resources

License

Stars

Watchers

Forks

Languages