Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
-
Updated
May 28, 2024 - PHP
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
A self-hosted search engine for documents.
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.
A very simple news crawler with a funny name
Get text content from any file
Extract embedded metadata from HTML markup
Translate visual novels in real time
Module for automatic summarization of text documents and HTML pages.
This GitHub repository hosts the notebooks and tools developed as part of this thesis to automate the extraction, processing, and analysis of data from the MICCAI 2023 conference, aiding in the systematic review and providing a structured foundation for further research in this crucial area.
A TYPO3 CMS extension that provides Apache Tika functionality
OCR with Tesseract and OpenCV: Extract text from images effortlessly. Preprocess with OpenCV for accuracy. Display results and save output. Easy integration for document digitization and data entry automation.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Heuristic based boilerplate removal tool
Golang PDF library for creating and processing PDF files (pure go)
Dataiku DSS plugin to perform optical character recognition (OCR) using the Tesseract engine.
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."