Fix a couple of race-conditions on shutdown #119

rosa · 2024-01-10T20:45:59Z

This fixes #108 and two other issues I found while working on that one:

Only try to release next unblocked job if the job actually completed, either successfully or with a failure, but it needs to have completed. Otherwise, we might be still claimed but signal the semaphore regardless, so it'd be lying about how many jobs are in progress. A good example where this might happen is when the worker is sent a QUIT signal to exit right away and the thread pool is killed. As the worker or the supervisor would try to release claimed executions after the shutdown, the claimed execution that holds the semaphore could be potentially blocked because the semaphore is held at least by itself. Then, depending on the order of the thread pool shutting down and the worker being deregistered, we could end up with a job trying to unblock itself and a stuck semaphore.
Similarly to the above, don't go through the general dispatch flow when releasing claimed executions. That's it, don't try to gain the concurrency lock, because claimed executions with concurrency limits that are released would most likely be holding the semaphore themselves, as it's released after completing. This means these claimed executions would go to blocked state upon release, leaving the semaphore busy. Just assume that if a job has a claimed execution, it's because it already gained the lock when going to ready. Shipped this one separately in Don't go through the general dispatch flow when releasing claimed executions #121, with proper tests.
Use exit instead of exit! on immediate termination in runnable processes so that at_exit hooks are run if needed.
Besides, remove logging for failing to deregister a process as it just adds noise, and we were re-raising the exception anyway.

I inadvertently lost this one on #57.

Either successfully or failing, but it needs to have completed. Otherwise, we might be still claimed but signal the semaphore regardless, so it'd be lying about how many jobs are in progress. A good example where this might happen is when the worker is sent a QUIT signal to exit right away, and the job gets an exit! midway. As the worker would try to release claimed executions after the shutdown, the claimed execution that holds the semaphore could be potentially blocked because the semaphore is held at least by itself. Then, depending on the order of the thread pool shutting down and the worker being deregistered, we could end up with a job trying to unblock itself and a stuck semaphore.

…cutions That's it, don't try to gain the concurrency lock, because claimed executions with concurrency limits that are released would most likely be holding the semaphore themselves, as it's released after completing. This means these claimed executions would go to blocked upon release, leaving the semaphore busy. Just assume that if a job has a claimed execution, it's because it already gained the lock when going to ready.

So that `at_exit` hooks are run if needed. Besides, remove logging for failing to deregister a process as it just adds noise, and we were re-raising the exception anyway.

That's it, via an `exit` call that becomes SystemExit, only run it when we're shutting down and exiting the loop because of that. It that's not the case, rely on the supervisor to do the clean-up.

rosa added 5 commits January 10, 2024 15:37

Add missing index for release to blocked_executions table

48ba9fb

I inadvertently lost this one on #57.

Use exit instead exit! on immediate termination

ce1a054

So that `at_exit` hooks are run if needed. Besides, remove logging for failing to deregister a process as it just adds noise, and we were re-raising the exception anyway.

Don't run shutdown in forks when exit is requested before we get there

1d246da

That's it, via an `exit` call that becomes SystemExit, only run it when we're shutting down and exiting the loop because of that. It that's not the case, rely on the supervisor to do the clean-up.

This was referenced Jan 10, 2024

deregistering process issues? #108

Closed

How would you kill a specific job? #124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a couple of race-conditions on shutdown #119

Fix a couple of race-conditions on shutdown #119

rosa commented Jan 10, 2024 •

edited

Fix a couple of race-conditions on shutdown #119

Are you sure you want to change the base?

Fix a couple of race-conditions on shutdown #119

Conversation

rosa commented Jan 10, 2024 • edited

rosa commented Jan 10, 2024 •

edited