Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a couple of race-conditions on shutdown #119

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rosa
Copy link
Member

@rosa rosa commented Jan 10, 2024

This fixes #108 and two other issues I found while working on that one:

  • Only try to release next unblocked job if the job actually completed, either successfully or with a failure, but it needs to have completed. Otherwise, we might be still claimed but signal the semaphore regardless, so it'd be lying about how many jobs are in progress. A good example where this might happen is when the worker is sent a QUIT signal to exit right away and the thread pool is killed. As the worker or the supervisor would try to release claimed executions after the shutdown, the claimed execution that holds the semaphore could be potentially blocked because the semaphore is held at least by itself. Then, depending on the order of the thread pool shutting down and the worker being deregistered, we could end up with a job trying to unblock itself and a stuck semaphore.
  • Similarly to the above, don't go through the general dispatch flow when releasing claimed executions. That's it, don't try to gain the concurrency lock, because claimed executions with concurrency limits that are released would most likely be holding the semaphore themselves, as it's released after completing. This means these claimed executions would go to blocked state upon release, leaving the semaphore busy. Just assume that if a job has a claimed execution, it's because it already gained the lock when going to ready. Shipped this one separately in Don't go through the general dispatch flow when releasing claimed executions #121, with proper tests.
  • Use exit instead of exit! on immediate termination in runnable processes so that at_exit hooks are run if needed.
    Besides, remove logging for failing to deregister a process as it just adds noise, and we were re-raising the exception anyway.

Either successfully or failing, but it needs to have completed. Otherwise,
we might be still claimed but signal the semaphore regardless, so it'd be
lying about how many jobs are in progress. A good example where this might
happen is when the worker is sent a QUIT signal to exit right away, and the
job gets an exit! midway. As the worker would try to release claimed executions
after the shutdown, the claimed execution that holds the semaphore could be
potentially blocked because the semaphore is held at least by itself. Then,
depending on the order of the thread pool shutting down and the worker being
deregistered, we could end up with a job trying to unblock itself and a stuck
semaphore.
…cutions

That's it, don't try to gain the concurrency lock, because claimed executions with
concurrency limits that are released would most likely be holding the semaphore
themselves, as it's released after completing. This means these claimed executions
would go to blocked upon release, leaving the semaphore busy. Just assume that if
a job has a claimed execution, it's because it already gained the lock when going
to ready.
So that `at_exit` hooks are run if needed.
Besides, remove logging for failing to deregister a process as it just
adds noise, and we were re-raising the exception anyway.
That's it, via an `exit` call that becomes SystemExit, only run it when we're
shutting down and exiting the loop because of that.
It that's not the case, rely on the supervisor to do the clean-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deregistering process issues?
1 participant