Skip to content

StaatsbibliothekBerlin/ocrd_butler

Repository files navigation

ocrd_butler

Processing tasks in the ecosystem of the OCR-D project project.

Features

REST API to define workflows of processors and run tasks with it in the OCR-D ecosystem.

Hints

  • Free software: MIT License
  • This software is in alpha state yet, so don't expect it to work properly. Support is currently not guarenteed.

Development installation

We rely on the excellent installation repository ocrd_all. Please check it out for installation.

Installation is currently tested on Debian 10 and Ubuntu 18.04. Be aware that on more up-to-date systems with Python >= 3.8.x there is currently a problem installing tensorflow==1.15.x, so you have to use at most Python 3.7.

Installation for development:

Install Redis Server (needed as backend for Celery and Flower)

user@server:/ > sudo apt install redis
user@server:/ > sudo service redis start

Follow the installation for ocrd_all

/home/ocrd > git clone --recurse-submodules https://github.com/OCR-D/ocrd_all.git && cd ocrd_all
/home/ocrd/ocrd_all > make all
... -> download appropriate modules...

Install german language files for Tesseract OCR:

user@server:/ > sudo apt install tesseract-ocr-deu

Install ocrd-butler in the virtual environment created by ocrd_all:

/home/ocrd > git clone https://github.com/StaatsbibliothekBerlin/ocrd_butler.git & cd ocrd-butler
/home/ocrd/ocrd-butler > source ../ocrd_all/venv/bin/activate
(venv) /home/ocrd/ocrd-butler > pip install -e .[dev]

For some modules in ocrd_all there are further files nessesary, e.g. trained models for the OCR itself. The folders on the server can be overwritten it every single task.

  • sbb_textline_detector (i.e. make textline-detector-model):
> mkdir -p /data && cd /data; \
> ocrd resmgr download ocrd-sbb-textline-detector default -al cwd
  • ocrd_calamari (i.e. make calamari-model):
> mkdir -p /data && cd /data; \
> ocrd resmgr download ocrd-calamari-recognize qurator-gt4histocr-1.0 -al cwd
  • ocrd_tesserocr (i.e. make tesseract-model):
> mkdir -p /data/tesseract_models && cd /data/tesseract_models
> wget https://qurator-data.de/tesseract-models/GT4HistOCR/models.tar
> tar xf models.tar
> cp GT4HistOCR_2000000.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
  • ocrd-sbb-binarize (i.e. make sbb-binarize-model)
> mkdir -p /data && cd /data; \
> ocrd resmgr download ocrd-sbb-binarize default -al cwd

Start celery worker (i.e. make run-celery):

╰─$ TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata celery worker -A ocrd_butler.celery_worker.celery -E -l info

Start flower monitor (i.e. make run-flower):

╰─$ flower --broker redis://localhost:6379 --persistent=True --db=flower [--log=debug --url_prefix=flower]

Flower monitor: http://localhost:5555

Run the app (i.e. make run-flask):

╰─$ FLASK_APP=ocrd_butler/app.py flask run

Flask frontend: http://localhost:5000 Swagger interface: http://localhost:5000/api

Run the tests:

╰─$ make test

Usage

For API documentation, open the Swagger API user interface at /api/. A complete list of all routes mapped by the OCRD Butler application is available under the /api/_util/routes endpoint.

Creating a workflow

A Butler workflow consists of a name and one or more OCRD processor invocations. Use the /api/workflows POST endpoint to create a new workflow (all examples given using HTTPie):

╰─$ http POST :/api/workflows < workflow.json

...where the content of workflow.json looks something like this:

{
  "name": "binarize && segment to regions",
  "processors": [
    {
      "name": "ocrd-olena-binarize",
      "input_file_grp": "DEFAULT",
      "output_file_grp": "OCR-D-IMG-BIN"
    },
    {
      "name": "ocrd-tesserocr-segment-region",
      "input_file_grp": "OCR-D-IMG-BIN",
      "output_file_grp": "OCR-D-SEG-REGION"
    }
  ]
}

The response body will contain the ID of the newly created workflow. Use this ID for retrieval of the newly created workflow:

╰─$ http :/api/workflows/1  # or whatever ID obtained in previous step

Creating a task

A Butler task is an invocation of a workflow with a specific METS file as its input. A task consists of at least such a METS source file location, and a workflow ID. Use the /api/tasks POST endpoint to create a new task using an existing workflow:

╰─$ http POST :/api/tasks src=https://content.staatsbibliothek-berlin.de/dc/PPN718448162.mets.xml workflow_id=1

The response body will contain the ID of the newly created task.

Running a task

In order to execute an existing Butler task, call the /api/tasks/{id}/run endpoint, with the placeholder replaced by the actual task ID obtained in the previous step:

╰─$ http POST :/api/tasks/1/run

Known problems

ModuleNotFoundError: No module named 'tensorflow.contrib'

. venv/activate
pip install --upgrade pip
pip uninstall tensorflow
pip install tensorflow-gpu==1.15.*

TODOs

  • input and output filegroups are not always from the previous processor
    • more complicated input/output group scenarios
    • check the infos we get from ocrd-tools.json
  • dinglehopper:

    - If there are Ground Truth data it could be placed in a configured folder on the server with the data as page xml files inside a folder id named with the work id. Then we show a button to start a run against this data. Otherwise we can search for all other tasks with the same work_id and present a UI to run against the choosen one.

  • Use processor groups to be able to build forms with these presented.
  • Check if ocrd-olena-binarize fail with another name for a METS file in a workspace then mets.xml.
  • Refactor ocrd_tool information collection to https://ocr-d.de/en/spec/cli#-j---dump-json

This package was created with Cookiecutter and the elgertam/cookiecutter-pipenv project template, based on audreyr/cookiecutter-pypackage.

About

A butler is a domestic worker in a large household. The butler, as the senior servant, has the highest servant status. He can also sometimes function as a chauffeur.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published