Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.
/ tika-page-extractor Public archive

Tika per page PDF extractor server returning content as JSON.

License

Notifications You must be signed in to change notification settings

mkalus/tika-page-extractor

Repository files navigation

Tika Page Extractor

This application set up a server that extracts content from PDFs and returns it as JSON string, one entry per page. Additionally, meta data of the file will be returned, too.

-> Download zipped Binary

Start server using:

java -jar TikaPageExtractor.jar

Adding -h to the line above will print command line options you can set, e.g. the server's port and ip address, or options to exclude meta data extraction and the like.

By default, the server will listen to port 9090.

You can PUT pdf file contents to the server, similar to Solr. Using curl you can do the following, for example:

curl -X PUT -T file.pdf http://localhost:9090/

This will upload file.pdf to the server and return its content as JSON. The server will return document metadata and an array of strings, one entry per page.

The server was inspired by http://vteams.com/blog/apache-tika-per-page-content-extraction/. Thanks!

About

Tika per page PDF extractor server returning content as JSON.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages