Improve WorkerLost errors handling #8969

kwaszczuk · 2024-04-16T19:52:03Z

kwaszczuk
Apr 16, 2024

In my current team, we have encountered difficulties when gracefully handling WorkerLost errors (especially OOMs) while avoiding their infinite requeues. What I would like to see in Celery is the capability to requeue a task after WorkerLost error only a limited number of times.

Currently, Celery provides us with the task_reject_on_worker_lost setting, which always calls reject(requeue=True) under the hood. This means it delegates task retrying directly to the broker. However, most (if not all, I've checked RabbitMQ, Redis, and SQS) brokers do not support a retry counting mechanism when the requeue=True option is used. Therefore, without a workaround, Celery will indefinitely requeue some tasks, causing WorkerLost errors.

One way to work around this problem is to not use task_reject_on_worker_lost and instead implement a custom retry mechanism based on RabbitMQ DLX. However, this solution requires the task_acks_on_failure_or_timeout=False setting, which significantly disrupts Celery capabilities in terms of general error handling.

I would be happy to work on improving the current state of things in Celery in this matter. As of now, I have come up with two possible approaches (A & B) for improving Celery's WorkerLost handling:

A. Use Celery's retry mechanism (`app.Task.retry()`) instead of `reject(requeue=True)`

Pros:

Retries could be easily limited using Task.max_retries or manually by accessing app.Task.request.retries.
Requires minimal code changes.

Cons:

Not fully backward-compatible - tasks may stop retrying indefinitely by default due to the preconfigured Task.max_retries limit.

B. Introduce additional Celery settings `task_requeue_on_worker_lost` and `task_worker_lost_is_failure`

task_requeue_on_worker_lost would allow us to select whether reject() should be called with False or True, while task_worker_lost_is_failure would define if a task should be reported as failed if it wasn't requeued. With these two settings available, it would be possible to implement WorkerLost retries with DLX without affecting the lifecycle of non-WorkerLost exceptions.

Pros:

Allows for granular control over WorkerLost lifecycle - we can decide whether to requeue, reject and fail, or reject without fail for the DLX to jump in.
Fully backward-compatible - we can ship new options with default values that keep WorkerLost handling unchanged.

Cons:

WorkerLost handling will now be configurable through three options, making it more complicated.
Does not provide explicit limitable retries - it still needs to be delegated to the broker.

I am quite sure that meaningful changes for both the above solutions would be limited to the following part of the Celery codebase: https://github.com/celery/celery/blob/v5.3.6/celery/worker/request.py#L600-L640

Personally, I prefer the A approach more. However, I am concerned about its deficiencies in backward-compatibility. Considering the task_reject_on_worker_lost option was introduced over 9 years ago, there's a high chance many Celery deployments implicitly depend on its current behavior. Perhaps, we could address this by making the new implementation opt-in while slowly deprecating the original one.

I would love to get feedback about the suggested solutions so that we could agree on the best course of action. I am willing to provide some pseudo-code implementations for both solutions if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve WorkerLost errors handling #8969

{{title}}

Replies: 0 comments

Select a reply

Improve WorkerLost errors handling #8969

kwaszczuk Apr 16, 2024

A. Use Celery's retry mechanism (app.Task.retry()) instead of reject(requeue=True)

B. Introduce additional Celery settings task_requeue_on_worker_lost and task_worker_lost_is_failure

Replies: 0 comments

kwaszczuk
Apr 16, 2024

A. Use Celery's retry mechanism (`app.Task.retry()`) instead of `reject(requeue=True)`

B. Introduce additional Celery settings `task_requeue_on_worker_lost` and `task_worker_lost_is_failure`