-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job supposed to be re-queued on worker shutdown but it never is #352
Comments
according to Hangfire.PostgreSql/src/Hangfire.PostgreSql/PostgreSqlFetchedJob.cs Lines 62 to 72 in bfaf2c4
jobqueue table, but I don't see any records in that table at all
|
Please share the signature of c# method that represents your job. If I'm not mistaken, Requeue should work if your job is cancelled gracefully with a cancellation token. You can simply add Cancellation Token as the last parameter, and hangfire will inject it. Of course you need to make a proper use of the token. However, is it really a problem? The job will not leave queue, until it is done. Even if interrupted midway, hangfire should restart it at some point the future. |
@dmitry-vychikov thank you for your prompt reply, I much appreciate it. Allow me to share more details. The job is a potentially long running job that gets scheduled by the recurring job scheduler on a daily cron. It's important that the job runs to completion after being scheduled, but it doesn't matter if it's interrupted as long as it is picked up again before the next job is scheduled. When you say:
This is where I'm not sure things are working properly. The job does leave the queue after my pods restart and the log ( Here are some details about my application:
registry
.AddHangfire((sp, config) => config.UsePostgreSqlStorage(c =>
c.UseNpgsqlConnection(sp.GetRequiredService<IDbRepository>().ConnectionString)))
.AddHangfireServer();
var lifetime = app.ApplicationServices.GetRequiredService<IHostApplicationLifetime>();
var logger = app.ApplicationServices.GetService<ILogger>();
lifetime.ApplicationStopping.Register(() =>
{
logger?.Warning("Stopping event SIGTERM received, waiting for 10 seconds");
Thread.Sleep(10_000);
logger?.Warning("Termination delay complete, continuing stopping process");
}); I do pass the cancellation token in my job, but I don't currently use it. I chose not to use it because it's actually OK for the health of my job if it's aborted in the middle, but maybe this is part of the problem (maybe Hangfire needs the job to exit gracefully in order for the re-queueing to work? I don't know.). Here is my job signature with real code redacted, but it's the same idea [AutomaticRetry(Attempts = 3)]
[DisableConcurrentExecution(24 * 60 * 60)] // the job runs daily, and should run to completion, but depending on how many records it processes it could take a long time. Skip the next day if we're taking longer than 24 hours.
public async Task Run(
PerformContext performContext,
CancellationToken cancellationToken)
{
// real code here redacted, but the job basically gets records from a database, does some operations and makes some http calls, then writes to a database and moves on to the next record
// each 5000 delay here simulates going from record to record
foreach (var workItem in new[] { 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000 })
{
_logger.LogInformation("doing work...");
await Task.Delay(workItem);
}
} Are you suggesting that I update my code to do this, using the cancellation token: [AutomaticRetry(Attempts = 3)]
[DisableConcurrentExecution(24 * 60 * 60)] // the job runs daily, and should run to completion, but depending on how many records it processes it could take a long time. Skip the next day if we're taking longer than 24 hours.
public async Task Run(
PerformContext performContext,
CancellationToken cancellationToken)
{
// real code here redacted, but the job basically gets records from a database, does some operations and makes some http calls, then writes to a database and moves on to the next record
// each 5000 delay here simulates going from record to record
foreach (var workItem in new[] { 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000 })
{
if (cancellationToken.IsCancellationRequested)
{
_logger.LogInformation("Cancelling job");
return;
}
_logger.LogInformation("doing work...");
await Task.Delay(workItem, cancellationToken);
}
} I can certainly make that change if it's the fix here, but based on the fact that the |
This is very strange because jobs are not removed from queue until they are fully processed (
Your implementation of 10 seconds suspend seems strange. I suggest to avoid using Thread.Sleep. For shutdown timeout, there must be a native way like this one: https://learn.microsoft.com/en-us/aspnet/core/fundamentals/host/generic-host?view=aspnetcore-6.0#shutdowntimeout . Does it not work as expected? I'm NOT completely sure that this is causing issues with Hangfire, but it still does not seem right.
Graceful shutdown is better, but it shouldn't be a problem provided that Hangfire storage behaves correctly (keeping the job in queue until it is completed). If what you explained is true, then making proper use of cancellation token may not help. Thoughts on hangfireIs it correct that you have a @azygis do you have any ideas? Next steps
General recommendations (not related to hangfire)
I think that keeping such long-running jobs is not a very good idea. This will be very hard to make it run successfully provided that you already started having issues with server restarts. Also, 24h processing time seems like a big performance issue. Is the operation that complex, so that it takes so much time? Are you doing any Entity Framework queries inside of your foreach loop? If yes, then polluted change tracker must be the culprit. |
@dmitry-vychikov Thank you again for your detailed reply. I will address in parts. A thought from the end added to the beginningWhile I was writing all of this up I discovered cases in my logs where the jobs were in fact being picked up again by other workers. So, that's great! Seems like we have cases where this is working as intended. Before we go into further debugging detail, I am wondering if this is possible:
I am wondering if this could be the root cause of my confusion here. It appears like the job never finished due to the restart, but in fact it does, simply because it sneaks over the finish line in at the very end. The cases I have found that look like this have the following records in their [
{
"jobid": 1024,
"stateid": 4346,
"statename": "Processing",
"reason": "Triggered by recurring job scheduler",
"createdat": "2024-03-08 01:00:03.391915 +00:00"
},
{
"jobid": 1024,
"stateid": 4348,
"statename": "Processing",
"reason": null,
"createdat": "2024-03-08 01:00:03.467332 +00:00"
},
{
"jobid": 1024,
"stateid": 15858,
"statename": "Processing",
"reason": null,
"createdat": "2024-03-08 01:30:03.541153 +00:00"
}
] Notice there is never any follow up state, it simply is left in Anyway, read on for the other details addressed.
|
@dmitry-vychikov I've done a bunch more battle testing on this in real deployed environments and can confidently say that this happens often. I have many records of jobs in my job / state tables where they are cancelled, hangfire says they will be requeued, but they are never picked back up. I think my original speculation of "the job finishes during the shutdown delay" is wrong, it's not likely that the timing would line up like that so often. |
@alexrosenfeld10 i think the problem is caused by ordering statement in job queue. It prioritizes jobs without start date higher than jobs that already started, but not completed. Can you please confirm this theory? You need to check: Total number of jobs in jobqueue table, number of jobs where Ordering has been recently discussed here #348 Cc @azygis |
I don't think the ordering has anything to do here since the table is allegedly simply empty. Sorry, I do not have capacity this week to dive in detail here; I appreciate @dmitry-vychikov covering here. |
Agreed, unless something else is happening in between that is related to ordering, it's not the issue, as the job queue table ends up emptied in the end but the job doesn't ever get picked back up. |
I see you mention kubernetes. Are you certain that the shutdown of a pod waits gracefully until the app is free? We had to handle SIGTERMs in some of our apps to allow finishing the shutdown properly (not related to Hangfire, just general stuff). On the other hand, I see that requeueing should update an existing item by unsetting the fetchedat. Hmmm... |
Yeah, I am certain we handle graceful shutdown. What do you mean "wait until app is free"? |
Not sure this helps, but one other data point - the |
Does kubernetes wait until your current job is cancelled and exists the execution, I mean. |
Oh - no, it waits for 10sec, during which time hangfire should and does cancel the cancellation token. My job does not shut down though - it runs until the application dies, in other words, it doesn't do anything with the cancellation token. But regardless, hangfire should notice that (it logs |
Right - unless Hangfire thinks it's no longer in a valid state, or the job timed out. The only way the job can be deleted from the table is it's specifically asked to be removed from the queue. It's done by Hangfire and not by the provider. Again, I can't dig too much right now - but I see the timeout mentioned in the comment. Maybe you'd see how Hangfire comes to this place, or file a QA/issue there? Feel free to link this issue too. I'm curious as well what happens with such a job. We don't ever keep these long running jobs at work personally. If we have some sort of monitoring job, we just schedule it to run again in 10s or so. I think for orchestration of some other processes you could also change the job in this way so the jobs are always short-lived (as they really should for Hangfire, considering it also has invisibility timeout which can pick the same job again in 30min or something). |
@azygis what defines "no longer valid state"? What about timeout? these are long running jobs, how do I adjust the timeout window / job duration allowance? In general I am seeing why long running jobs are a pain, I am doing some refactoring right now to make them shorter / split them up, but it'd be great to know more about how timeout / job cancellation is handled. Also, yes, I did open an issue to core hangfire and have not gotten any replies. |
I do not know the inner workings of Hangfire - that's why I suggested asking these questions there as they know it way better, or try to read the Hangfire repository. |
@alexrosenfeld10 can you please dump the database table with job statuses visible? |
Sure, @dmitry-vychikov @azygis I have already shared the data in this comment #352 (comment) but can share again. Please find below: select *
from vfs.hangfire.jobqueue jq
where jq.jobid = '1024';
-- returns 0 rows select *
from vfs.hangfire.job j
inner join vfs.hangfire.state s on j.id = s.jobid
where j.id = '1024'; returns:
|
Here is the data in JSON format, perhaps it's more readable: [
{
"id": 4346,
"stateid": 15858,
"statename": "Processing",
"invocationdata": {"Type": "IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null", "Method": "Run", "Arguments": "[null,null]", "ParameterTypes": "[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\",\"System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\"]"},
"arguments": [null, null],
"createdat": "2024-03-08 01:00:03.391915 +00:00",
"expireat": null,
"updatecount": 0,
"jobid": 1024,
"name": "Enqueued",
"reason": "Triggered by recurring job scheduler",
"data": {"Queue": "default", "EnqueuedAt": "2024-03-08T01:00:03.3897652Z"}
},
{
"id": 4348,
"stateid": 15858,
"statename": "Processing",
"invocationdata": {"Type": "IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null", "Method": "Run", "Arguments": "[null,null]", "ParameterTypes": "[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\",\"System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\"]"},
"arguments": [null, null],
"createdat": "2024-03-08 01:00:03.467332 +00:00",
"expireat": null,
"updatecount": 0,
"jobid": 1024,
"name": "Processing",
"reason": null,
"data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "57fa28c0-f090-4bfb-b6e1-c64a79cf824f", "StartedAt": "2024-03-08T01:00:03.4565093Z"}
},
{
"id": 15858,
"stateid": 15858,
"statename": "Processing",
"invocationdata": {"Type": "IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null", "Method": "Run", "Arguments": "[null,null]", "ParameterTypes": "[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\",\"System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\"]"},
"arguments": [null, null],
"createdat": "2024-03-08 01:30:03.541153 +00:00",
"expireat": null,
"updatecount": 0,
"jobid": 1024,
"name": "Processing",
"reason": null,
"data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "389ff6f5-cb1b-4643-8e9c-140cd7680ca4", "StartedAt": "2024-03-08T01:30:03.5303326Z"}
}
] |
Joined tables as JSON are quite confusing to understand, it's not exactly clear which data comes from which table, but if my reading is right, looks like two workers are possibly picking it up ~50ms apart? |
@azygis what format would you like, I can share again |
@alexrosenfeld10 Can you please share records from |
@dmitry-vychikov kindly reread my previous message - rewritten below: select *
from vfs.hangfire.jobqueue jq
where jq.jobid = '1024';
-- returns 0 rows |
Sorry, i'm reading this chat in short breaks between work :( I thing the recurring job is the problem here. I suggest the following:
|
Do not join the tables. Select them one by one as JSON. It would be best to actually get the whole dump as is, but not entirely sure if dump is able to filter the rows. |
@azygis thanks. Job table:
State table: [
{
"id": 4346,
"jobid": 1024,
"name": "Enqueued",
"reason": "Triggered by recurring job scheduler",
"createdat": "2024-03-08 01:00:03.391915 +00:00",
"data": {"Queue": "default", "EnqueuedAt": "2024-03-08T01:00:03.3897652Z"},
"updatecount": 0
},
{
"id": 4348,
"jobid": 1024,
"name": "Processing",
"reason": null,
"createdat": "2024-03-08 01:00:03.467332 +00:00",
"data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "57fa28c0-f090-4bfb-b6e1-c64a79cf824f", "StartedAt": "2024-03-08T01:00:03.4565093Z"},
"updatecount": 0
},
{
"id": 15858,
"jobid": 1024,
"name": "Processing",
"reason": null,
"createdat": "2024-03-08 01:30:03.541153 +00:00",
"data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "389ff6f5-cb1b-4643-8e9c-140cd7680ca4", "StartedAt": "2024-03-08T01:30:03.5303326Z"},
"updatecount": 0
}
] |
I can't give you a full pgdump, there's sensitive data, so I have to query and filter, apologies. |
@dmitry-vychikov what do you mean? IIRC recurring job still works the same way, except it's enqueued when cron "occurs" so to speak.
@alexrosenfeld10 I'm really starting to think invisibility timeout is at play here. If you look at when the first job starts, it's 01:00:03. The second job (same, but new instance) starts 01:30:03. Misread the previous json. Hangfire has a mechanic for "avoiding" hung jobs and by default if the job doesn't finish in 30min, it's considered dead, even though it might still be running. The configuration option is called This kinda leaves two options - either set the timeout to 1 day (think you mentioned something like pods restarting every day or something?), or even There's also DisableConcurrentExecutionAttribute which you can put on the job class/method. Make sure it's correct placement. Considering the following that you have given previously:
In this case, attribute has to be on the interface level. Not sure yet if attribute has any correlation to invisibility timeout though. Let me know if any of these changes yield a better flow. |
@alexrosenfeld10 one more suspicious thing I noticed and forgot to ask. What's up with state.id jumping from 4348 straight to 15858 in 30min? How are there seemingly 10k more state rows in such a short period? |
@azygis apologies for the delay, I have been sick. While it seems crazy, I think that's actually correct. There are thousands of other jobs running in this application, most of which are very quick and lightweight jobs to trigger other heavier weight http requests between two separate entities. They get triggered in batches, and this was one of the first batches to run so the load was the highest (subsequent batches will be smaller, first one of each rollout phase is big). As I understand it, each job takes up minimum 3 state record counts ( |
Thanks for the update. Didn't mean it's "wrong" per se, was more curious whether it's true. If it is, it is. Please try the suggestions as well once you're able. Two separate workers picking up the same job is usually an indication that invisibility timeout has passed. |
@azygis the problem is less "two workers picking up the same job" and more "job gets cut off and never picked back up again". I did refactor my job to split things up a bit (which means they run shorter) and also to handle the cancellation token cancel more correctly. It helped and things seem more stable, so I will probably end up moving forward with that approach making this a "non issue"... that said, I still feel like there are issues here to do with jobs not getting resumed properly. Here's a new question - what happens when:
What happens when Hangfire tries to restart the job? Would it fail to start because of the Separate question - if I wanted to, how do I change the invisibility timeout? I can't find it in any of the config callbacks. I'm on latest version of both HF and HF PG |
@alexrosenfeld10 I was thinking that a second worker picking up the same job has something to do with it not getting requeued again. There's no proof though as it's hard to replicate. Checked how it would work with the attribute - essentially what the attribute does is tries to place a lock on a resource when the job decorated with it is starting ( Note: I did not test it, just looked at the code of how attribute is built and what we do with the locking. To change the invisibility timeout, use an overload that accepts the storage options. Based on what you provided earlier, it would be like so (watch out for rogue braces, wrote this without an IDE): registry
.AddHangfire((sp, config) => config.UsePostgreSqlStorage(c =>
c.UseNpgsqlConnection(sp.GetRequiredService<IDbRepository>().ConnectionString), new PostgreSqlStorageOptions { InvisibilityTimeout = TimeSpan.FromDays(1) }))
.AddHangfireServer(); |
Gotcha. Btw, I've noticed that since implementing the cancellation token handling, when the job gets cancelled it just ends in state It seems like if I just check and exit, hangfire thinks the job finished with success and doesn't get re-queued. Thanks - I'll try that out. |
It is up to you how to handle it gracefully. You can throw and it will be retried when restarted (as long as you don't prevent it with If you just leave the job by returning from it, Hangfire has no way of knowing it actually was stopped mid-way and you want to resume it later, since the state of the job can be invalid to start it again. Check if |
@azygis Thanks, got it. I'll work on implementing all this and look to get the job system stable. Thanks for all the discussion, this has been super helpful. I'd still love to get to the bottom of why the job doesn't re-queue when it should, but overall, I definitely have the tools to refactor things and make the system stable from this conversation. |
@alexrosenfeld10 it was helpful to me as well to know a bit more about the flow. Unfortunately how and why is a question for Hangfire repo as we're pretty much doing what Hangfire tells us to, from the storage side. I will close the issue here and wish you luck, hopefully they'll reply on their issue in due time. |
Hi, sometimes my pods are rescheduled (for whatever reason, rolling restart / node upgrade / new deployment, etc.) and I get a message in my logs:
However, despite this message, the job is never re-queued.
I've done a bunch of reading here, specifically HangfireIO/Hangfire#2026, but can't seem to find the right answer as to why it's never being re-queued.
I can share the database records for these jobs if it helps. Thanks in advance!
(this is the same issue as HangfireIO/Hangfire#2378, I'm not sure which repo the solution would reside in)
The text was updated successfully, but these errors were encountered: