Support abandoning jobs that take too long to cancel #78

vito · 2023-11-27T21:29:04Z

Context

Hiya, thanks for making River! Really digging it so far.

I have an app that queues jobs that I don't really need to wait to complete before exiting, even after cancelling. I just want to guarantee fast deploys and let jobs just get tossed around like a hot potato during a rolling deploy.

Before I continue: this isn't blocking me, and it's a nuanced topic on an edge case within an edge case, so no rush.

StopAndCancel is almost what I want. Cancelled jobs end up in a retryable state and they either get picked back up elsewhere or recovered when the instance comes back. The one case it doesn't handle is when jobs take too long to stop (typically by not respecting ctx.Done(), likely a bug).

I tried to pass a timeout ctx to StopAndCancel, but it doesn't currently respect it:

stopCtx, stop := context.WithTimeout(context.Background(), 10*time.Second)
defer stop()
if err := db.workerClient.StopAndCancel(stopCtx); err != nil {
	slog.Error("failed to stop and cancel workers", "error", err)
}
if stopCtx.Err() != nil {
	slog.Warn("timed out cleaning up jobs")
}

With a job that does time.Sleep(time.Minute) the StopAndCancel will just wait the full 1 minute instead of being interrupted after 10 seconds. This seems like a simple bug, happy to submit a PR.

Even if it did respect the timeout, there's a secondary problem: if we bail out at that point, the worker's jobs will be left in a running state. At this point the job is stuck, and if you have uniqueness rules like I do that'll prevent future attempts too. As far as I can tell the only recourse at this point is to delete the job from the database or set its state to retryable.

Proposal

This might be problematic for other use cases, but at least for mine, this would be a big help:

StopAndCancel respects ctx cancellation
When StopAndCancel times out it marks all running jobs on the local worker as retryable

Alternatively, having some way to do (2) independently would be fine. For example I could do a db.workerClient.Abandon(stopCtx) in the if stopCtx.Err() != nil { branch above.

This would allow graceful shutdown in most cases, prevent hanging in the "truly stuck" case, and allow jobs to be retried in all cases except kill -9.

Alternative

The root of the problem here is really the stuck running jobs. I would also be OK with leaving everything as-is and instead just enforcing a stop timeout at the infra layer, letting the app get kill -9ed, and having River automatically detect when a running job is abandoned so things aren't stuck forever.

The text was updated successfully, but these errors were encountered:

An earlier refactor caused these graceful shutdown routines to stop respecting the provided context. No matter what happened with the provided context, they would continue blocking, though they would ultimately return the context's error upon return. With this change, both stop methods will return early if the provided context is cancelled or timed out _prior_ to the clean shutdown completing. This makes it straightforward to implement a shutdown flow which first calls `Stop` with a context that expires after 5 seconds, and after that call `StopAndCancel`. If the shutdown _still_ isn't complete, the user can choose to exit anyway. Partially fixes #78.

bgentry · 2023-11-27T23:17:07Z

Thanks for reporting this @vito! I've fixed your issue (1) above as part of #79. This is something we used to have but broke during a refactor in the push toward release.

The second part is a bit trickier, although I think it's conceptually possible. Internally the Client knows which jobs haven't yet completed, so it's conceivable that we could grab that list and make a final SQL update prior to exiting. It would be challenging to make sure that this is all done safely and that there are no adverse effects or panics if a job happens to finish while we've already given up on it and are marking it as retryable. It's also a matter of when we make that update: do we do so prior to StopAndCancel returning? If its context is already expired, what context do we use for the SQL update given that all existing contexts have already been cancelled?

I do think that given we can't actually halt a running goroutine in Go, this is probably the right design for us in the end, if we can find a reasonable way to make it work cleanly.

Meanwhile, your best bet is to rely on a combination of StopAndCancel with a context timeout, as well as the RescueStuckJobsAfter setting tuned to a duration where you're comfortable considering your jobs to be "stuck" so that they are returned to retryable. This isn't ideal, because if you have jobs that sometimes intentionally run long they will still get "rescued" after that duration and retried again.

An earlier refactor caused these graceful shutdown routines to stop respecting the provided context. No matter what happened with the provided context, they would continue blocking, though they would ultimately return the context's error upon return. With this change, both stop methods will return early if the provided context is cancelled or timed out _prior_ to the clean shutdown completing. This makes it straightforward to implement a shutdown flow which first calls `Stop` with a context that expires after 5 seconds, and after that call `StopAndCancel`. If the shutdown _still_ isn't complete, the user can choose to exit anyway. Partially fixes #78.

vito · 2023-11-27T23:41:24Z

@bgentry Oh nice, didn't know about RescueStuckJobsAfter! That combined with #79 works for me, so feel free to close. Considering the complexity you mentioned, a global config that solves the general problem is even better.

brandur · 2023-11-28T14:01:07Z

Nice. #79 + RescueStuckJobsAfter should go a long way towards helping here.

I am a little afraid that uncancellable jobs (and by extension jobs that don't die well after a client stops and are rescued an hour later) end up being a major Achilles Heel for people just given how easy it is to write an uncancellable job in Go. Having the client do one final pass where it basically runs the rescuer on all jobs that were cancelled but didn't stop to get their state back in good order for the next time a client starts up might not be the worst idea.

An earlier refactor caused these graceful shutdown routines to stop respecting the provided context. No matter what happened with the provided context, they would continue blocking, though they would ultimately return the context's error upon return. With this change, both stop methods will return early if the provided context is cancelled or timed out _prior_ to the clean shutdown completing. This makes it straightforward to implement a shutdown flow which first calls `Stop` with a context that expires after 5 seconds, and after that call `StopAndCancel`. If the shutdown _still_ isn't complete, the user can choose to exit anyway. Partially fixes #78.

bgentry · 2023-11-28T15:03:43Z

The first part of this is resolved as per #79 and will be included in the upcoming release. It sounds like @brandur is eager to also see if we can find a way to cleanly handle stuck jobs on shutdown, so I will leave this issue open to track that additional work. Thanks again for the detailed issue @vito 🙏

brandur · 2023-11-28T15:58:31Z

It sounds like @brandur is eager to also see if we can find a way to cleanly handle stuck jobs on shutdown,

Yep. On my feature list!

brandur · 2023-12-02T23:59:33Z

@vito We just cut a new release for stopping respecting context cancellation. 0.0.11.

vito · 2023-12-03T00:32:02Z

@brandur Awesome, thanks for following up! 🙏

bgentry mentioned this issue Nov 27, 2023

respect provided context in Stop + StopAndCancel #79

Merged

bgentry closed this as completed in #79 Nov 28, 2023

bgentry reopened this Nov 28, 2023

bgentry mentioned this issue Nov 28, 2023

tweak graceful shutdown example to exit after soft shutdown #80

Merged

bgentry added the enhancement New feature or request label Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support abandoning jobs that take too long to cancel #78

Support abandoning jobs that take too long to cancel #78

vito commented Nov 27, 2023

bgentry commented Nov 27, 2023

vito commented Nov 27, 2023

brandur commented Nov 28, 2023

bgentry commented Nov 28, 2023

brandur commented Nov 28, 2023

brandur commented Dec 2, 2023

vito commented Dec 3, 2023

Support abandoning jobs that take too long to cancel #78

Support abandoning jobs that take too long to cancel #78

Comments

vito commented Nov 27, 2023

Context

Proposal

Alternative

bgentry commented Nov 27, 2023

vito commented Nov 27, 2023

brandur commented Nov 28, 2023

bgentry commented Nov 28, 2023

brandur commented Nov 28, 2023

brandur commented Dec 2, 2023

vito commented Dec 3, 2023