Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support abandoning jobs that take too long to cancel #78

Open
vito opened this issue Nov 27, 2023 · 7 comments · Fixed by #79
Open

Support abandoning jobs that take too long to cancel #78

vito opened this issue Nov 27, 2023 · 7 comments · Fixed by #79
Labels
enhancement New feature or request

Comments

@vito
Copy link

vito commented Nov 27, 2023

Context

Hiya, thanks for making River! Really digging it so far.

I have an app that queues jobs that I don't really need to wait to complete before exiting, even after cancelling. I just want to guarantee fast deploys and let jobs just get tossed around like a hot potato during a rolling deploy.

Before I continue: this isn't blocking me, and it's a nuanced topic on an edge case within an edge case, so no rush.

StopAndCancel is almost what I want. Cancelled jobs end up in a retryable state and they either get picked back up elsewhere or recovered when the instance comes back. The one case it doesn't handle is when jobs take too long to stop (typically by not respecting ctx.Done(), likely a bug).

I tried to pass a timeout ctx to StopAndCancel, but it doesn't currently respect it:

stopCtx, stop := context.WithTimeout(context.Background(), 10*time.Second)
defer stop()
if err := db.workerClient.StopAndCancel(stopCtx); err != nil {
	slog.Error("failed to stop and cancel workers", "error", err)
}
if stopCtx.Err() != nil {
	slog.Warn("timed out cleaning up jobs")
}

With a job that does time.Sleep(time.Minute) the StopAndCancel will just wait the full 1 minute instead of being interrupted after 10 seconds. This seems like a simple bug, happy to submit a PR.

Even if it did respect the timeout, there's a secondary problem: if we bail out at that point, the worker's jobs will be left in a running state. At this point the job is stuck, and if you have uniqueness rules like I do that'll prevent future attempts too. As far as I can tell the only recourse at this point is to delete the job from the database or set its state to retryable.

Proposal

This might be problematic for other use cases, but at least for mine, this would be a big help:

  1. StopAndCancel respects ctx cancellation
  2. When StopAndCancel times out it marks all running jobs on the local worker as retryable

Alternatively, having some way to do (2) independently would be fine. For example I could do a db.workerClient.Abandon(stopCtx) in the if stopCtx.Err() != nil { branch above.

This would allow graceful shutdown in most cases, prevent hanging in the "truly stuck" case, and allow jobs to be retried in all cases except kill -9.

Alternative

The root of the problem here is really the stuck running jobs. I would also be OK with leaving everything as-is and instead just enforcing a stop timeout at the infra layer, letting the app get kill -9ed, and having River automatically detect when a running job is abandoned so things aren't stuck forever.

bgentry added a commit that referenced this issue Nov 27, 2023
An earlier refactor caused these graceful shutdown routines to stop
respecting the provided context. No matter what happened with the
provided context, they would continue blocking, though they would
ultimately return the context's error upon return.

With this change, both stop methods will return early if the provided
context is cancelled or timed out _prior_ to the clean shutdown
completing. This makes it straightforward to implement a shutdown flow
which first calls `Stop` with a context that expires after 5 seconds,
and after that call `StopAndCancel`. If the shutdown _still_ isn't
complete, the user can choose to exit anyway.

Partially fixes #78.
@bgentry
Copy link
Contributor

bgentry commented Nov 27, 2023

Thanks for reporting this @vito! I've fixed your issue (1) above as part of #79. This is something we used to have but broke during a refactor in the push toward release.

The second part is a bit trickier, although I think it's conceptually possible. Internally the Client knows which jobs haven't yet completed, so it's conceivable that we could grab that list and make a final SQL update prior to exiting. It would be challenging to make sure that this is all done safely and that there are no adverse effects or panics if a job happens to finish while we've already given up on it and are marking it as retryable. It's also a matter of when we make that update: do we do so prior to StopAndCancel returning? If its context is already expired, what context do we use for the SQL update given that all existing contexts have already been cancelled?

I do think that given we can't actually halt a running goroutine in Go, this is probably the right design for us in the end, if we can find a reasonable way to make it work cleanly.

Meanwhile, your best bet is to rely on a combination of StopAndCancel with a context timeout, as well as the RescueStuckJobsAfter setting tuned to a duration where you're comfortable considering your jobs to be "stuck" so that they are returned to retryable. This isn't ideal, because if you have jobs that sometimes intentionally run long they will still get "rescued" after that duration and retried again.

bgentry added a commit that referenced this issue Nov 27, 2023
An earlier refactor caused these graceful shutdown routines to stop
respecting the provided context. No matter what happened with the
provided context, they would continue blocking, though they would
ultimately return the context's error upon return.

With this change, both stop methods will return early if the provided
context is cancelled or timed out _prior_ to the clean shutdown
completing. This makes it straightforward to implement a shutdown flow
which first calls `Stop` with a context that expires after 5 seconds,
and after that call `StopAndCancel`. If the shutdown _still_ isn't
complete, the user can choose to exit anyway.

Partially fixes #78.
@vito
Copy link
Author

vito commented Nov 27, 2023

@bgentry Oh nice, didn't know about RescueStuckJobsAfter! That combined with #79 works for me, so feel free to close. Considering the complexity you mentioned, a global config that solves the general problem is even better.

@brandur
Copy link
Contributor

brandur commented Nov 28, 2023

Nice. #79 + RescueStuckJobsAfter should go a long way towards helping here.

I am a little afraid that uncancellable jobs (and by extension jobs that don't die well after a client stops and are rescued an hour later) end up being a major Achilles Heel for people just given how easy it is to write an uncancellable job in Go. Having the client do one final pass where it basically runs the rescuer on all jobs that were cancelled but didn't stop to get their state back in good order for the next time a client starts up might not be the worst idea.

bgentry added a commit that referenced this issue Nov 28, 2023
An earlier refactor caused these graceful shutdown routines to stop
respecting the provided context. No matter what happened with the
provided context, they would continue blocking, though they would
ultimately return the context's error upon return.

With this change, both stop methods will return early if the provided
context is cancelled or timed out _prior_ to the clean shutdown
completing. This makes it straightforward to implement a shutdown flow
which first calls `Stop` with a context that expires after 5 seconds,
and after that call `StopAndCancel`. If the shutdown _still_ isn't
complete, the user can choose to exit anyway.

Partially fixes #78.
bgentry added a commit that referenced this issue Nov 28, 2023
An earlier refactor caused these graceful shutdown routines to stop
respecting the provided context. No matter what happened with the
provided context, they would continue blocking, though they would
ultimately return the context's error upon return.

With this change, both stop methods will return early if the provided
context is cancelled or timed out _prior_ to the clean shutdown
completing. This makes it straightforward to implement a shutdown flow
which first calls `Stop` with a context that expires after 5 seconds,
and after that call `StopAndCancel`. If the shutdown _still_ isn't
complete, the user can choose to exit anyway.

Partially fixes #78.
@bgentry bgentry reopened this Nov 28, 2023
@bgentry
Copy link
Contributor

bgentry commented Nov 28, 2023

The first part of this is resolved as per #79 and will be included in the upcoming release. It sounds like @brandur is eager to also see if we can find a way to cleanly handle stuck jobs on shutdown, so I will leave this issue open to track that additional work. Thanks again for the detailed issue @vito 🙏

@brandur
Copy link
Contributor

brandur commented Nov 28, 2023

It sounds like @brandur is eager to also see if we can find a way to cleanly handle stuck jobs on shutdown,

Yep. On my feature list!

@brandur
Copy link
Contributor

brandur commented Dec 2, 2023

@vito We just cut a new release for stopping respecting context cancellation. 0.0.11.

@vito
Copy link
Author

vito commented Dec 3, 2023

@brandur Awesome, thanks for following up! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants