Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How would you kill a specific job? #124

Open
wheee opened this issue Jan 14, 2024 · 5 comments
Open

How would you kill a specific job? #124

wheee opened this issue Jan 14, 2024 · 5 comments

Comments

@wheee
Copy link

wheee commented Jan 14, 2024

Scenario would be a long-running job that is taking too long and the user wishes to kill it and not have it restarted.

If I were to send the TERM signal to the supervisor pid, I've noticed it has this weird side effect of restarting everything (not just solid_queue) in my procfile (when using foreman).

I also noticed if I were send a TERM signal to the worker (assuming it was a 1 thread/1 process worker), then the worker would get restarted and pick up the same job again.

I suppose it's possible to modify the Job in the solid_queue_jobs table such that its finished_at is set then send the TERM signal to the worker but that seems hack-ish.

Also, what if the worker has 5 threads and they're all processing jobs that I don't want to kill?

Would appreciate some direction on this, thanks!

@wheee
Copy link
Author

wheee commented Jan 14, 2024

Also wondering... how would you define a callback for your jobs to gracefully handle TERM signals?

When I'm using Resque, I would handle Resque::TermException in the on_failure hook and put any cleanup/shutdown logic in there.

I took a quick look at the source and I do see SolidQueue::Processes::GracefulTerminationRequested but trying to rescue that via rescue_from in my job didn't seem to do anything.

@rosa
Copy link
Member

rosa commented Jan 16, 2024

Hey @wheee! These are great questions, and I'm afraid Solid Queue doesn't have a way to support your scenario:

Scenario would be a long-running job that is taking too long and the user wishes to kill it and not have it restarted.

As you saw, SolidQueue::Processes::GracefulTerminationRequested doesn't reach the jobs because it's only raised within the supervisor when it receives a TERM signal.

I also noticed if I were send a TERM signal to the worker (assuming it was a 1 thread/1 process worker), then the worker would get restarted and pick up the same job again.

That's right. The supervisor would notice the worker has exited and would start it again. If the worker didn't have time to finish the job within the configured shutdown_timeout time, then all jobs currently being run by its thread pool would have been killed and left in a claimed state. Then, after being deregistered, these claimed jobs would have been released back to the queue so they could be picked up again. This is different from resque, where the Resque::TermException bubbles up to the jobs. Right now there's nothing similar to this in Solid Queue, but I'm reevaluating all this, starting with no longer having the workers exit via exit!, but rather via exit (#119). I'm still unsure about that, which is why I haven't merged that PR yet.

If you have any ideas or suggestions, feel free to contribute them! 🙏

@wheee
Copy link
Author

wheee commented Jan 17, 2024

Hey @rosa, thanks for responding!

This is different from resque, where the Resque::TermException bubbles up to the jobs. Right now there's nothing similar to this in Solid Queue, but I'm reevaluating all this, starting with no longer having the workers exit via exit!, but rather via exit (#119). I'm still unsure about that, which is why I haven't merged that PR yet.

I think there's merit in being able to bubble up the 'shutdown' signal/exception to the jobs (and not just the workers). Although, I can see why it may not be as useful unless the ability to kill a job via signals was a possibility.

That being said, I do have scenarios where I leverage these jobs as long running processes - that remain up and running until explicitly shut down by user command. These jobs typically follow a pub/sub paradigm and can take commands from the UI while streaming data from an external data source. In these cases, it would be nice to be able to clean up gracefully when a TERM signal is received within the allotted duration before the QUIT signal is issued.

EDIT: the fact that we have a configuration for shutdown_timeout would suggest that jobs should have the ability to respond to the TERM signal... otherwise, why provide the extra time before the QUIT signal is sent?

As a side note, while exploring GoodJob in more detail, I did run across this useful bit:
image

Happy to report that this works pretty nicely with SolidQueue, so while it may not be possible to explicitly kill jobs via signals, at least the use of Timeout provides assurances that jobs that get stuck for whatever reason will eventually timeout and can be handled gracefully and return to the pool. And more importantly, allow me to decide whether I wish to fail the job or retry, etc.

@rosa
Copy link
Member

rosa commented Feb 19, 2024

EDIT: the fact that we have a configuration for shutdown_timeout would suggest that jobs should have the ability to respond to the TERM signal... otherwise, why provide the extra time before the QUIT signal is sent?

To give time to the jobs in-flight to finish, and not take any other jobs. If we don't provide any extra time, any job in-flight will be stopped right away. With the extra time, the worker knows it shouldn't pick up any more jobs, just wait until the ones running finish and then finish.

@bdewater
Copy link

If you have any ideas or suggestions, feel free to contribute them! 🙏

For gems like https://github.com/Shopify/job-iteration it is useful to have a way to know a graceful shutdown was initiated, so that it can stop after the current iteration is finished and do it's own graceful shutdown (pushing the job back on the queue with the persisted progress).

It interacts in various ways with background queues: it uses the Sidekiq quiet callback, GoodJob recently added current_thread_shutting_down? to query for this, and for Resque a monkey patch is used.

IMO a callback is a bit more flexible since for other use cases one might not be able to/want to poll for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants