Implement new catalog indexer-worker #4147

stacimc · 2024-04-17T23:13:21Z

Problem

This issue tracks the creation of a new catalog-indexer-worker Docker image. It does not include adding any orchestration steps to the DAG, or any infrastructure work to actually create ASGs.

Description

First we will create a new indexer-worker directory under catalog/dags/data_refresh, which will contain the contents of the new indexer worker. This implementation already exists in the ingestion server. The relevant pieces can be pulled out and refactored slightly to fit the new, much smaller image. Broadly, this is the mapping of existing files to new files needed:

api.py will defined the API for the worker, and is refactored from the existing indexer_worker.py. It must be refactored to add task state and a task_status endpoint, which takes a task_id and returns the status and progress of the given task.
indexer.py will contain the logic for the actual indexing task. It will be refactored from the existing indexer.py; specifically all we need is the replicate function.
elasticsearch_models.py, pulled from the file of the same name in the ingestion server. Defines a mapping from a database record to an Elasticsearch document.
Utility files for helper functions for connecting to Elasticsearch and Postgres (e.g. es_helpers.py)

The Dockerfile can be copied from the existing ingestion server. It should be updated to reference the new file structure, and to expose only a single port, which should be distinguished from the ports currently in use by the ingestion server (8001 and 8002). Other necessary files, including env.docker, .dockerignore, Pipfile, and gunicorn.conf.py can all be copied in from the existing ingestion server as well.

Finally we will update the monorepo’s root docker-compose.yml to add a new catalog-indexer-worker service. Its build context should point to the nested data_refresh/indexer_worker directory, and it should map the exposed port to enable the API to be reached by the catalog.

When this work is complete, it should be possible to run just catalog/shell and curl the new indexer worker. The existing ingestion-server and indexer-worker services are unaffected (it is still possible to run legacy data refreshes locally and in production).

Additional context

See this section of the IP.

The text was updated successfully, but these errors were encountered:

stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 17, 2024

stacimc added this to the Removal of the data refresh server milestone Apr 17, 2024

This was referenced Apr 17, 2024

Implement local distributed reindexing #4148

Open

Removal of the ingestion server #3925

Open

stacimc self-assigned this May 7, 2024

stacimc linked a pull request May 14, 2024 that will close this issue

Add catalog indexer worker #4330

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement new catalog indexer-worker #4147

Implement new catalog indexer-worker #4147

stacimc commented Apr 17, 2024

Implement new catalog indexer-worker #4147

Implement new catalog indexer-worker #4147

Comments

stacimc commented Apr 17, 2024

Problem

Description

Additional context