Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs that raise exceptions sometimes stay stuck as Running #1249

Open
zarqman opened this issue Feb 15, 2024 · 1 comment
Open

Jobs that raise exceptions sometimes stay stuck as Running #1249

zarqman opened this issue Feb 15, 2024 · 1 comment

Comments

@zarqman
Copy link
Contributor

zarqman commented Feb 15, 2024

Seeing an issue here where jobs that raise exceptions and are rescheduled for a later retry remain stuck as Running according to GoodJob (as checked in the UI). As such, they never get reattempted.

GoodJob is running as a separate process. Killing/restarting the GJ process clears the stuck status and the jobs are then immediately run (assuming the retry time has been reached, of course).

Given that the Running state immediately clears upon restarting the process, I'm wondering if this is a stuck advisory lock?

I've been trying to track this for a bit as it's infrequent. Current version is 3.23, but have seen it going back to at least 3.19.4. It's happened in production and development.

Just saw it happen on a group of 13 simultaneously scheduled jobs where 2 of 13 jobs raised an exception for a later retry (the exception is expected in this case and not the issue itself). The retry is being handled using AJ's standard rescue/retry mechanism.

The 2 jobs that raised exceptions did so while other jobs were either starting or running. Both of those 2 ended up stuck as Running. Their retries were scheduled in the middle of other jobs completing (ie: after some, before others).

The job class of the stuck jobs does not use concurrency limits. 12 of the 13 jobs were for that same class.

Older instances of this problem have returned a "Jobs were interrupted" message--at least sometimes. This latest instance didn't, so it's possible there was something else going on in the older cases.

Any thoughts? How can I help debug this?

@zarqman
Copy link
Contributor Author

zarqman commented Feb 26, 2024

I've seen a couple more instances of this. It turns out jobs only sometime report Running. Other times they report Queued. So, the reported state may be more of a symptom rather than a cause.

Perhaps a clue is that trying to Reschedule a Queued job results in an exception indicating the job is already advisory locked.

Restarting the GoodJob process continues to immediately unlock the affected jobs and let them run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

1 participant