You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my current team, we have encountered difficulties when gracefully handling WorkerLost errors (especially OOMs) while avoiding their infinite requeues. What I would like to see in Celery is the capability to requeue a task after WorkerLost error only a limited number of times.
Currently, Celery provides us with the task_reject_on_worker_lost setting, which always calls reject(requeue=True) under the hood. This means it delegates task retrying directly to the broker. However, most (if not all, I've checked RabbitMQ, Redis, and SQS) brokers do not support a retry counting mechanism when the requeue=True option is used. Therefore, without a workaround, Celery will indefinitely requeue some tasks, causing WorkerLost errors.
One way to work around this problem is to not use task_reject_on_worker_lost and instead implement a custom retry mechanism based on RabbitMQ DLX. However, this solution requires the task_acks_on_failure_or_timeout=False setting, which significantly disrupts Celery capabilities in terms of general error handling.
I would be happy to work on improving the current state of things in Celery in this matter. As of now, I have come up with two possible approaches (A & B) for improving Celery's WorkerLost handling:
A. Use Celery's retry mechanism (app.Task.retry()) instead of reject(requeue=True)
Pros:
Retries could be easily limited using Task.max_retries or manually by accessing app.Task.request.retries.
Requires minimal code changes.
Cons:
Not fully backward-compatible - tasks may stop retrying indefinitely by default due to the preconfigured Task.max_retries limit.
B. Introduce additional Celery settings task_requeue_on_worker_lost and task_worker_lost_is_failure
task_requeue_on_worker_lost would allow us to select whether reject() should be called with False or True, while task_worker_lost_is_failure would define if a task should be reported as failed if it wasn't requeued. With these two settings available, it would be possible to implement WorkerLost retries with DLX without affecting the lifecycle of non-WorkerLost exceptions.
Pros:
Allows for granular control over WorkerLost lifecycle - we can decide whether to requeue, reject and fail, or reject without fail for the DLX to jump in.
Fully backward-compatible - we can ship new options with default values that keep WorkerLost handling unchanged.
Cons:
WorkerLost handling will now be configurable through three options, making it more complicated.
Does not provide explicit limitable retries - it still needs to be delegated to the broker.
Personally, I prefer the A approach more. However, I am concerned about its deficiencies in backward-compatibility. Considering the task_reject_on_worker_lost option was introduced over 9 years ago, there's a high chance many Celery deployments implicitly depend on its current behavior. Perhaps, we could address this by making the new implementation opt-in while slowly deprecating the original one.
I would love to get feedback about the suggested solutions so that we could agree on the best course of action. I am willing to provide some pseudo-code implementations for both solutions if needed.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In my current team, we have encountered difficulties when gracefully handling WorkerLost errors (especially OOMs) while avoiding their infinite requeues. What I would like to see in Celery is the capability to requeue a task after WorkerLost error only a limited number of times.
Currently, Celery provides us with the
task_reject_on_worker_lost
setting, which always calls reject(requeue=True) under the hood. This means it delegates task retrying directly to the broker. However, most (if not all, I've checked RabbitMQ, Redis, and SQS) brokers do not support a retry counting mechanism when the requeue=True option is used. Therefore, without a workaround, Celery will indefinitely requeue some tasks, causing WorkerLost errors.One way to work around this problem is to not use
task_reject_on_worker_lost
and instead implement a custom retry mechanism based on RabbitMQ DLX. However, this solution requires thetask_acks_on_failure_or_timeout=False
setting, which significantly disrupts Celery capabilities in terms of general error handling.I would be happy to work on improving the current state of things in Celery in this matter. As of now, I have come up with two possible approaches (A & B) for improving Celery's WorkerLost handling:
A. Use Celery's retry mechanism (
app.Task.retry()
) instead ofreject(requeue=True)
Pros:
Task.max_retries
or manually by accessingapp.Task.request.retries
.Cons:
Task.max_retries
limit.B. Introduce additional Celery settings
task_requeue_on_worker_lost
andtask_worker_lost_is_failure
task_requeue_on_worker_lost
would allow us to select whether reject() should be called with False or True, whiletask_worker_lost_is_failure
would define if a task should be reported as failed if it wasn't requeued. With these two settings available, it would be possible to implement WorkerLost retries with DLX without affecting the lifecycle of non-WorkerLost exceptions.Pros:
Cons:
I am quite sure that meaningful changes for both the above solutions would be limited to the following part of the Celery codebase: https://github.com/celery/celery/blob/v5.3.6/celery/worker/request.py#L600-L640
Personally, I prefer the A approach more. However, I am concerned about its deficiencies in backward-compatibility. Considering the
task_reject_on_worker_lost
option was introduced over 9 years ago, there's a high chance many Celery deployments implicitly depend on its current behavior. Perhaps, we could address this by making the new implementation opt-in while slowly deprecating the original one.I would love to get feedback about the suggested solutions so that we could agree on the best course of action. I am willing to provide some pseudo-code implementations for both solutions if needed.
Beta Was this translation helpful? Give feedback.
All reactions