Implement new catalog indexer-worker #4147
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Milestone
Problem
This issue tracks the creation of a new catalog-indexer-worker Docker image. It does not include adding any orchestration steps to the DAG, or any infrastructure work to actually create ASGs.
Description
First we will create a new indexer-worker directory under catalog/dags/data_refresh, which will contain the contents of the new indexer worker. This implementation already exists in the ingestion server. The relevant pieces can be pulled out and refactored slightly to fit the new, much smaller image. Broadly, this is the mapping of existing files to new files needed:
api.py
will defined the API for the worker, and is refactored from the existing indexer_worker.py. It must be refactored to add task state and a task_status endpoint, which takes a task_id and returns the status and progress of the given task.indexer.py
will contain the logic for the actual indexing task. It will be refactored from the existing indexer.py; specifically all we need is the replicate function.elasticsearch_models.py
, pulled from the file of the same name in the ingestion server. Defines a mapping from a database record to an Elasticsearch document.The Dockerfile can be copied from the existing ingestion server. It should be updated to reference the new file structure, and to expose only a single port, which should be distinguished from the ports currently in use by the ingestion server (8001 and 8002). Other necessary files, including env.docker, .dockerignore, Pipfile, and gunicorn.conf.py can all be copied in from the existing ingestion server as well.
Finally we will update the monorepo’s root docker-compose.yml to add a new catalog-indexer-worker service. Its build context should point to the nested data_refresh/indexer_worker directory, and it should map the exposed port to enable the API to be reached by the catalog.
When this work is complete, it should be possible to run
just catalog/shell
and curl the new indexer worker. The existing ingestion-server and indexer-worker services are unaffected (it is still possible to run legacy data refreshes locally and in production).Additional context
See this section of the IP.
The text was updated successfully, but these errors were encountered: