Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job supposed to be re-queued on worker shutdown but it never is #352

Closed
alexrosenfeld10 opened this issue Mar 8, 2024 · 38 comments
Closed

Comments

@alexrosenfeld10
Copy link

Hi, sometimes my pods are rescheduled (for whatever reason, rolling restart / node upgrade / new deployment, etc.) and I get a message in my logs:

Worker stop requested while processing background job 'xxx'. It will be re-queued.

However, despite this message, the job is never re-queued.

I've done a bunch of reading here, specifically HangfireIO/Hangfire#2026, but can't seem to find the right answer as to why it's never being re-queued.

I can share the database records for these jobs if it helps. Thanks in advance!

(this is the same issue as HangfireIO/Hangfire#2378, I'm not sure which repo the solution would reside in)

@alexrosenfeld10
Copy link
Author

according to

public void Requeue()
{
_storage.UseConnection(null, connection => connection.Execute($@"
UPDATE ""{_storage.Options.SchemaName}"".""jobqueue""
SET ""fetchedat"" = NULL
WHERE ""id"" = @Id;
",
new { Id }));
_requeued = true;
}
this should update something in the jobqueue table, but I don't see any records in that table at all

@dmitry-vychikov
Copy link
Contributor

Please share the signature of c# method that represents your job.

If I'm not mistaken, Requeue should work if your job is cancelled gracefully with a cancellation token. You can simply add Cancellation Token as the last parameter, and hangfire will inject it. Of course you need to make a proper use of the token.

However, is it really a problem? The job will not leave queue, until it is done. Even if interrupted midway, hangfire should restart it at some point the future.

@alexrosenfeld10
Copy link
Author

alexrosenfeld10 commented Mar 9, 2024

@dmitry-vychikov thank you for your prompt reply, I much appreciate it.

Allow me to share more details. The job is a potentially long running job that gets scheduled by the recurring job scheduler on a daily cron. It's important that the job runs to completion after being scheduled, but it doesn't matter if it's interrupted as long as it is picked up again before the next job is scheduled.

When you say:

The job will not leave queue, until it is done. Even if interrupted midway, hangfire should restart it at some point the future.

This is where I'm not sure things are working properly. The job does leave the queue after my pods restart and the log (Worker stop requested while processing background job 'xxx'. It will be re-queued.) appears - the job never gets picked back up, and the jobqueue table is empty.

Here are some details about my application:

  • I have my hangfire server settings set to all defaults:
    registry
        .AddHangfire((sp, config) => config.UsePostgreSqlStorage(c =>
            c.UseNpgsqlConnection(sp.GetRequiredService<IDbRepository>().ConnectionString)))
        .AddHangfireServer();
  • I have my application shutdown set to suspend for 10 seconds before shutting down. Notably, I think this causes problems where the CancellationToken in jobs is not cancelled, because the default ShutdownTimeout is 15 seconds. Please let me know if you think this is a problem (or the problem), and I can adjust the timing of the app shutdown to be greater than 15.
        var lifetime = app.ApplicationServices.GetRequiredService<IHostApplicationLifetime>();
        var logger = app.ApplicationServices.GetService<ILogger>();
        lifetime.ApplicationStopping.Register(() =>
        {
            logger?.Warning("Stopping event SIGTERM received, waiting for 10 seconds");
            Thread.Sleep(10_000);
            logger?.Warning("Termination delay complete, continuing stopping process");
        });

I do pass the cancellation token in my job, but I don't currently use it. I chose not to use it because it's actually OK for the health of my job if it's aborted in the middle, but maybe this is part of the problem (maybe Hangfire needs the job to exit gracefully in order for the re-queueing to work? I don't know.).

Here is my job signature with real code redacted, but it's the same idea

    [AutomaticRetry(Attempts = 3)]
    [DisableConcurrentExecution(24 * 60 * 60)] // the job runs daily, and should run to completion, but depending on how many records it processes it could take a long time. Skip the next day if we're taking longer than 24 hours.
    public async Task Run(
        PerformContext performContext,
        CancellationToken cancellationToken)
    {
        // real code here redacted, but the job basically gets records from a database, does some operations and makes some http calls, then writes to a database and moves on to the next record
        // each 5000 delay here simulates going from record to record
        foreach (var workItem in new[] { 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000 })
        {
            _logger.LogInformation("doing work...");
            await Task.Delay(workItem);
        }
    }

Are you suggesting that I update my code to do this, using the cancellation token:

    [AutomaticRetry(Attempts = 3)]
    [DisableConcurrentExecution(24 * 60 * 60)] // the job runs daily, and should run to completion, but depending on how many records it processes it could take a long time. Skip the next day if we're taking longer than 24 hours.
    public async Task Run(
        PerformContext performContext,
        CancellationToken cancellationToken)
    {
        // real code here redacted, but the job basically gets records from a database, does some operations and makes some http calls, then writes to a database and moves on to the next record
        // each 5000 delay here simulates going from record to record
        foreach (var workItem in new[] { 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000 })
        {
            if (cancellationToken.IsCancellationRequested)
            {
                _logger.LogInformation("Cancelling job");
                return;
            }

            _logger.LogInformation("doing work...");
            await Task.Delay(workItem, cancellationToken);
        }
    }

I can certainly make that change if it's the fix here, but based on the fact that the It will be re-queued log shows up, I'm not sure it will - if that log is appearing, wouldn't it mean the job should be re-queued, regardless of how I use the cancellation token? How come the log appears but after the application is shutdown the jobqueue table is empty?

@dmitry-vychikov
Copy link
Contributor

This is where I'm not sure things are working properly. The job does leave the queue after my pods restart and the log (Worker stop requested while processing background job 'xxx'. It will be re-queued.) appears - the job never gets picked back up, and the jobqueue table is empty.

This is very strange because jobs are not removed from queue until they are fully processed (Done state). We had a similar problem with long running jobs being aborted due to server restart, but our jobs were left in the queue with Processing state.

I have my application shutdown set to suspend for 10 seconds before shutting down. Notably, I think this causes problems where the CancellationToken in jobs is not cancelled, because the default ShutdownTimeout is 15 seconds. Please let me know if you think this is a problem (or the problem), and I can adjust the timing of the app shutdown to be greater than 15.

Your implementation of 10 seconds suspend seems strange. I suggest to avoid using Thread.Sleep. For shutdown timeout, there must be a native way like this one: https://learn.microsoft.com/en-us/aspnet/core/fundamentals/host/generic-host?view=aspnetcore-6.0#shutdowntimeout . Does it not work as expected?

I'm NOT completely sure that this is causing issues with Hangfire, but it still does not seem right.

I do pass the cancellation token in my job, but I don't currently use it. I chose not to use it because it's actually OK for the health of my job if it's aborted in the middle, but maybe this is part of the problem (maybe Hangfire needs the job to exit gracefully in order for the re-queueing to work? I don't know.).

Graceful shutdown is better, but it shouldn't be a problem provided that Hangfire storage behaves correctly (keeping the job in queue until it is completed). If what you explained is true, then making proper use of cancellation token may not help.

Thoughts on hangfire

Is it correct that you have a recurring job? There is a chance that the bug is somewhere within recurring jobs, but again this is very strange to hear. If I'm not mistaken, recurring jobs are just triggers that enqueue "normal" jobs. Therefore, both types should be processed almost identically with the same durability guarantees.

@azygis do you have any ideas?

Next steps

  1. @alexrosenfeld10 can you upload a minimal sample app that reproduces the problem?
    I had a very-very brief look at source code. This should be working correctly. I'm surprised it doesn't.

    • If full sample is not possible, please provide screenshots of Hangfire dashboard. Please show job info prior to execution; while it is in progress; and after ungraceful shutdown.
  2. Remove Thread.Sleep

General recommendations (not related to hangfire)

// the job runs daily, and should run to completion, but depending on how many records it processes it could take a long time. Skip the next day if we're taking longer than 24 hours.

I think that keeping such long-running jobs is not a very good idea. This will be very hard to make it run successfully provided that you already started having issues with server restarts.
I highly recommend that you split your job into smaller chunks that execute faster. One potential approach is to take your foreach loop, and enqueue 1 job per collection item (or maybe 1 job for 100/1000/10_000 items, depending on how large that collection is). You will gain more stability and granularity which will allow to restart only a small cancelled job instead of the whole process.

Also, 24h processing time seems like a big performance issue. Is the operation that complex, so that it takes so much time? Are you doing any Entity Framework queries inside of your foreach loop? If yes, then polluted change tracker must be the culprit.

@alexrosenfeld10
Copy link
Author

alexrosenfeld10 commented Mar 10, 2024

@dmitry-vychikov Thank you again for your detailed reply. I will address in parts.

A thought from the end added to the beginning

While I was writing all of this up I discovered cases in my logs where the jobs were in fact being picked up again by other workers. So, that's great! Seems like we have cases where this is working as intended.

Before we go into further debugging detail, I am wondering if this is possible:

  1. The job is almost over, and about to succeed
  2. Worker stop is requested and therefore the Worker stop requested while processing background job 'xxx'. It will be re-queued. appears.
  3. Before the application shuts down, the job actually finishes up, and gets removed from the queue as it finishes up
  4. The application shuts down before the job can be marked as Succeeded, but after it's removed from the jobqueue (don't know if it's possible, based on this I think it is, because it doesn't happen in a transaction with the updates to the job / state records)

I am wondering if this could be the root cause of my confusion here. It appears like the job never finished due to the restart, but in fact it does, simply because it sneaks over the finish line in at the very end.

The cases I have found that look like this have the following records in their job and state tables (the jobqueue table is empty):

[
  {
    "jobid": 1024,
    "stateid": 4346,
    "statename": "Processing",
    "reason": "Triggered by recurring job scheduler",
    "createdat": "2024-03-08 01:00:03.391915 +00:00"
  },
  {
    "jobid": 1024,
    "stateid": 4348,
    "statename": "Processing",
    "reason": null,
    "createdat": "2024-03-08 01:00:03.467332 +00:00"
  },
  {
    "jobid": 1024,
    "stateid": 15858,
    "statename": "Processing",
    "reason": null,
    "createdat": "2024-03-08 01:30:03.541153 +00:00"
  }
]

Notice there is never any follow up state, it simply is left in Processing and that's it. I wonder if that would be a result of the "sneak over the finish line" situation I'm describing here? Speculation - let me know what you think. It is also odd to see a 30 minute delay between the last two processing states.

Anyway, read on for the other details addressed.


HostOptions v.s. IHostApplicationLifetime

HostOptions does not exist in netstandard2.1 The code that handles this shutdown is located in a common library that bootstraps many functionalities for all applications at my company, the library targets netstandard2.1 for compatibility reasons.

That said - I'm not sure this is a problem to do. I agree that in general Thread.Sleep is a code smell, but in this case I think it's fine because requests and hosted services don't run on the main thread. You can easily test this yourself, put a long delay there and have a hosted service write console logs forever, the logs will continue until the very end of the shutdown delay.

Graceful shutown of job

Graceful shutdown is better, but it shouldn't be a problem provided that Hangfire storage behaves correctly

I agree.

Thoughts on Hangfire

Is it correct that you have a recurring job?

Yes, it's a recurring job, configured as such:

        var options = new RecurringJobOptions
        {
            TimeZone = TimeZoneInfo.GetSystemTimeZones()
                .First(x => x.BaseUtcOffset == TimeSpan.FromHours(jobConfig.TimezoneOffset))
        };

        recurringJobManager
            .AddOrUpdate<IMyJob>(
                nameof(IMyJob),
                x => x.Run(CancellationToken.None),
                IMyJob.CronSchedule,
                options);

Next steps

  1. I will try to make an MVC that reproduces this. Interestingly, I tried to reproduce it locally, and was unable to - locally, the job does return to the jobqueue with a null value for fetchedat and gets re-run upon subsequent runs of my app. However, in my deployed environment (kubernetes with 4 pods, restarts via rolling restart) this issue occurs. I am yet to dive deeper into why (they should be the same but clearly something is not), I am fully sure that the job is not being re-run and does not return to the jobqueue.
    • Speculating about the above, is it possible that the following happens:
      • Workers are A, B, C, D
      • Worker A running the job is asked to shutdown
      • Worker A cancels the job, cancellation tokens trigger, and puts the job back in the jobqueue with value null for fetchedat
      • Workers B, C, D haven't been asked to shut down yet (rolling restart goes one at a time), one of them takes the job up out of the jobqueue, but doesn't start it yet (I don't know tons about the inner workings of Hangfire, I may be speculating wrongly here, just brainstorming ideas).
      • Worker that took the job out of the jobqueue but hasn't started executing it yet is now asked to shutdown
      • We end up in this state

Regrettably I cannot provide dashboard screenshots, we don't use the dashboard in our deployed environment. I can provide you with database records though. I presume the job, state, and jobqueue tables are what you're going to want to look at? I've looked myself, basically the job shows succeeded, the state history is like Enqueued (by recurring job scheduler) -> Processing -> Processing (there are many processing states, all the same) -> Succeeded.

  1. I can try this, not sure it will make a difference though.

General recommendations

The job will very likely complete in < 2 hours, but depending on which customers the job finds itself needing to address (some weeks it will be a bigger group, some weeks it will not be much to do) it may run long. Even on big weeks, I would be surprised if it took more than 5 hours. The 24 hour limit is just a "round number" to say "if we longer than this, don't double up".

The problem here is, I have some server (and cloud service) capacity limits to consider, splitting things up into separate sub-jobs makes it difficult to ensure they don't all run concurrently and overwhelm things. It's very intentionally executing work items steadily, one by one, each work item is a large chunk of concurrent requests already.

The operation has to do with migrating both large in size and large in quantity video files and transcoding them. The file download and transcoding happens on separate servers, this job just orchestrates and babysits the process. As the job runs many times, each time the execution will be shorter (less work to do, last run did a lot, so next one has less left to do), until the next week when more customers are added again.

@alexrosenfeld10
Copy link
Author

@dmitry-vychikov I've done a bunch more battle testing on this in real deployed environments and can confidently say that this happens often. I have many records of jobs in my job / state tables where they are cancelled, hangfire says they will be requeued, but they are never picked back up. I think my original speculation of "the job finishes during the shutdown delay" is wrong, it's not likely that the timing would line up like that so often.

@dmitry-vychikov
Copy link
Contributor

@dmitry-vychikov I've done a bunch more battle testing on this in real deployed environments and can confidently say that this happens often. I have many records of jobs in my job / state tables where they are cancelled, hangfire says they will be requeued, but they are never picked back up. I think my original speculation of "the job finishes during the shutdown delay" is wrong, it's not likely that the timing would line up like that so often.

@alexrosenfeld10 i think the problem is caused by ordering statement in job queue. It prioritizes jobs without start date higher than jobs that already started, but not completed.

Can you please confirm this theory? You need to check: Total number of jobs in jobqueue table, number of jobs where fetchedat is null, and where it is NOT null.

Ordering has been recently discussed here #348

Cc @azygis
I think we have another good reason to remove ordering.

@azygis
Copy link
Collaborator

azygis commented Mar 13, 2024

I don't think the ordering has anything to do here since the table is allegedly simply empty.

Sorry, I do not have capacity this week to dive in detail here; I appreciate @dmitry-vychikov covering here.

@alexrosenfeld10
Copy link
Author

Agreed, unless something else is happening in between that is related to ordering, it's not the issue, as the job queue table ends up emptied in the end but the job doesn't ever get picked back up.

@azygis
Copy link
Collaborator

azygis commented Mar 13, 2024

I see you mention kubernetes. Are you certain that the shutdown of a pod waits gracefully until the app is free? We had to handle SIGTERMs in some of our apps to allow finishing the shutdown properly (not related to Hangfire, just general stuff).

On the other hand, I see that requeueing should update an existing item by unsetting the fetchedat.

Hmmm...

@alexrosenfeld10
Copy link
Author

Yeah, I am certain we handle graceful shutdown. What do you mean "wait until app is free"?

@alexrosenfeld10
Copy link
Author

Not sure this helps, but one other data point - the expireat column on the job records all show as null after this case happens.

@azygis
Copy link
Collaborator

azygis commented Mar 13, 2024

What do you mean "wait until app is free"?

Does kubernetes wait until your current job is cancelled and exists the execution, I mean.

@alexrosenfeld10
Copy link
Author

alexrosenfeld10 commented Mar 13, 2024

Oh - no, it waits for 10sec, during which time hangfire should and does cancel the cancellation token. My job does not shut down though - it runs until the application dies, in other words, it doesn't do anything with the cancellation token. But regardless, hangfire should notice that (it logs Worker stop requested while processing background job 'xxx'. It will be re-queued. after all) and re-queue it, right?

@azygis
Copy link
Collaborator

azygis commented Mar 13, 2024

Right - unless Hangfire thinks it's no longer in a valid state, or the job timed out.

The only way the job can be deleted from the table is it's specifically asked to be removed from the queue. It's done by Hangfire and not by the provider. Again, I can't dig too much right now - but I see the timeout mentioned in the comment. Maybe you'd see how Hangfire comes to this place, or file a QA/issue there? Feel free to link this issue too. I'm curious as well what happens with such a job.

We don't ever keep these long running jobs at work personally. If we have some sort of monitoring job, we just schedule it to run again in 10s or so. I think for orchestration of some other processes you could also change the job in this way so the jobs are always short-lived (as they really should for Hangfire, considering it also has invisibility timeout which can pick the same job again in 30min or something).

@alexrosenfeld10
Copy link
Author

@azygis what defines "no longer valid state"? What about timeout? these are long running jobs, how do I adjust the timeout window / job duration allowance?

In general I am seeing why long running jobs are a pain, I am doing some refactoring right now to make them shorter / split them up, but it'd be great to know more about how timeout / job cancellation is handled.

Also, yes, I did open an issue to core hangfire and have not gotten any replies.

@azygis
Copy link
Collaborator

azygis commented Mar 13, 2024

I do not know the inner workings of Hangfire - that's why I suggested asking these questions there as they know it way better, or try to read the Hangfire repository.

@dmitry-vychikov
Copy link
Contributor

Agreed, unless something else is happening in between that is related to ordering, it's not the issue, as the job queue table ends up emptied in the end but the job doesn't ever get picked back up.

@alexrosenfeld10 can you please dump the database table with job statuses visible?

@alexrosenfeld10
Copy link
Author

Sure, @dmitry-vychikov @azygis I have already shared the data in this comment #352 (comment) but can share again. Please find below:

select *
from vfs.hangfire.jobqueue jq
where jq.jobid = '1024';
-- returns 0 rows
select *
from vfs.hangfire.job j
         inner join vfs.hangfire.state s on j.id = s.jobid
where j.id = '1024';

returns:

1024,15858,Processing,"{""Type"": ""IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null"", ""Method"": ""Run"", ""Arguments"": ""[null,null]"", ""ParameterTypes"": ""[\""Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\"",\""System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\""]""}","[null, null]",2024-03-08 01:00:03.375680 +00:00,null,0,4346,1024,Enqueued,Triggered by recurring job scheduler,2024-03-08 01:00:03.391915 +00:00,"{""Queue"": ""default"", ""EnqueuedAt"": ""2024-03-08T01:00:03.3897652Z""}",0
1024,15858,Processing,"{""Type"": ""IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null"", ""Method"": ""Run"", ""Arguments"": ""[null,null]"", ""ParameterTypes"": ""[\""Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\"",\""System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\""]""}","[null, null]",2024-03-08 01:00:03.375680 +00:00,null,0,4348,1024,Processing,null,2024-03-08 01:00:03.467332 +00:00,"{""ServerId"": ""my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64"", ""WorkerId"": ""57fa28c0-f090-4bfb-b6e1-c64a79cf824f"", ""StartedAt"": ""2024-03-08T01:00:03.4565093Z""}",0
1024,15858,Processing,"{""Type"": ""IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null"", ""Method"": ""Run"", ""Arguments"": ""[null,null]"", ""ParameterTypes"": ""[\""Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\"",\""System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\""]""}","[null, null]",2024-03-08 01:00:03.375680 +00:00,null,0,15858,1024,Processing,null,2024-03-08 01:30:03.541153 +00:00,"{""ServerId"": ""my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64"", ""WorkerId"": ""389ff6f5-cb1b-4643-8e9c-140cd7680ca4"", ""StartedAt"": ""2024-03-08T01:30:03.5303326Z""}",0

@alexrosenfeld10
Copy link
Author

Here is the data in JSON format, perhaps it's more readable:

[
  {
    "id": 4346,
    "stateid": 15858,
    "statename": "Processing",
    "invocationdata": {"Type": "IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null", "Method": "Run", "Arguments": "[null,null]", "ParameterTypes": "[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\",\"System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\"]"},
    "arguments": [null, null],
    "createdat": "2024-03-08 01:00:03.391915 +00:00",
    "expireat": null,
    "updatecount": 0,
    "jobid": 1024,
    "name": "Enqueued",
    "reason": "Triggered by recurring job scheduler",
    "data": {"Queue": "default", "EnqueuedAt": "2024-03-08T01:00:03.3897652Z"}
  },
  {
    "id": 4348,
    "stateid": 15858,
    "statename": "Processing",
    "invocationdata": {"Type": "IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null", "Method": "Run", "Arguments": "[null,null]", "ParameterTypes": "[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\",\"System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\"]"},
    "arguments": [null, null],
    "createdat": "2024-03-08 01:00:03.467332 +00:00",
    "expireat": null,
    "updatecount": 0,
    "jobid": 1024,
    "name": "Processing",
    "reason": null,
    "data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "57fa28c0-f090-4bfb-b6e1-c64a79cf824f", "StartedAt": "2024-03-08T01:00:03.4565093Z"}
  },
  {
    "id": 15858,
    "stateid": 15858,
    "statename": "Processing",
    "invocationdata": {"Type": "IMyJob, My.Company.Namespace, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null", "Method": "Run", "Arguments": "[null,null]", "ParameterTypes": "[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\",\"System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\"]"},
    "arguments": [null, null],
    "createdat": "2024-03-08 01:30:03.541153 +00:00",
    "expireat": null,
    "updatecount": 0,
    "jobid": 1024,
    "name": "Processing",
    "reason": null,
    "data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "389ff6f5-cb1b-4643-8e9c-140cd7680ca4", "StartedAt": "2024-03-08T01:30:03.5303326Z"}
  }
]

@azygis
Copy link
Collaborator

azygis commented Mar 14, 2024

Joined tables as JSON are quite confusing to understand, it's not exactly clear which data comes from which table, but if my reading is right, looks like two workers are possibly picking it up ~50ms apart?

@alexrosenfeld10
Copy link
Author

@azygis what format would you like, I can share again

@dmitry-vychikov
Copy link
Contributor

@alexrosenfeld10 Can you please share records from jobqueue table too pls?

@alexrosenfeld10
Copy link
Author

@dmitry-vychikov kindly reread my previous message - rewritten below:

select *
from vfs.hangfire.jobqueue jq
where jq.jobid = '1024';
-- returns 0 rows

@dmitry-vychikov
Copy link
Contributor

Sorry, i'm reading this chat in short breaks between work :(

I thing the recurring job is the problem here.
When recurring job triggers, it does not put anything into jobqueue, which is why the job is never retried.

I suggest the following:

  • look in Hangfire docs related to recurring jobs and try to find something that mentions retry policies or something similar.
  • And since it is the recurring jobs, the graceful cancellation might actually help here. Try to use CancellationToken properly and see how it behaves with the server shutdown.

@azygis
Copy link
Collaborator

azygis commented Mar 14, 2024

@azygis what format would you like, I can share again

Do not join the tables. Select them one by one as JSON. It would be best to actually get the whole dump as is, but not entirely sure if dump is able to filter the rows.

@alexrosenfeld10
Copy link
Author

@azygis thanks. Job table:

[
  {
    "id": 1024,
    "stateid": 15858,
    "statename": "Processing",
    "invocationdata": {"Type": "IMyJob, My.Company, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null", "Method": "Run", "Arguments": "[null,null]", "ParameterTypes": "[\"Hangfire.Server.PerformContext, Hangfire.Core, Version=1.8.11.0, Culture=neutral, PublicKeyToken=null\",\"System.Threading.CancellationToken, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e\"]"},
    "arguments": [null, null],
    "createdat": "2024-03-08 01:00:03.375680 +00:00",
    "expireat": null,
    "updatecount": 0
  }
]

State table:

[
  {
    "id": 4346,
    "jobid": 1024,
    "name": "Enqueued",
    "reason": "Triggered by recurring job scheduler",
    "createdat": "2024-03-08 01:00:03.391915 +00:00",
    "data": {"Queue": "default", "EnqueuedAt": "2024-03-08T01:00:03.3897652Z"},
    "updatecount": 0
  },
  {
    "id": 4348,
    "jobid": 1024,
    "name": "Processing",
    "reason": null,
    "createdat": "2024-03-08 01:00:03.467332 +00:00",
    "data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "57fa28c0-f090-4bfb-b6e1-c64a79cf824f", "StartedAt": "2024-03-08T01:00:03.4565093Z"},
    "updatecount": 0
  },
  {
    "id": 15858,
    "jobid": 1024,
    "name": "Processing",
    "reason": null,
    "createdat": "2024-03-08 01:30:03.541153 +00:00",
    "data": {"ServerId": "my-service-66d8fc87c7-l8xzf:1:995c7965-c693-40e0-aed8-4de11b406d64", "WorkerId": "389ff6f5-cb1b-4643-8e9c-140cd7680ca4", "StartedAt": "2024-03-08T01:30:03.5303326Z"},
    "updatecount": 0
  }
]

@alexrosenfeld10
Copy link
Author

I can't give you a full pgdump, there's sensitive data, so I have to query and filter, apologies.

@azygis
Copy link
Collaborator

azygis commented Mar 14, 2024

I thing the recurring job is the problem here. When recurring job triggers, it does not put anything into jobqueue, which is why the job is never retried.

@dmitry-vychikov what do you mean? IIRC recurring job still works the same way, except it's enqueued when cron "occurs" so to speak.

<...> JSON </...>

@alexrosenfeld10 I'm really starting to think invisibility timeout is at play here. If you look at when the first job starts, it's 01:00:03. The second job (same, but new instance) starts 01:30:03. Misread the previous json.

Hangfire has a mechanic for "avoiding" hung jobs and by default if the job doesn't finish in 30min, it's considered dead, even though it might still be running. The configuration option is called InvisibilityTimeout. I think the fact that you have multiple instances of the same job running at the same time has something to do with what's happening.

This kinda leaves two options - either set the timeout to 1 day (think you mentioned something like pods restarting every day or something?), or even TimeSpan.MaxValue as a test. Keep in mind that this should be a temporary solution if it works, until you refactor the job so it doesn't take that much time alone.

There's also DisableConcurrentExecutionAttribute which you can put on the job class/method. Make sure it's correct placement. Considering the following that you have given previously:

        recurringJobManager
            .AddOrUpdate<IMyJob>(
                nameof(IMyJob),
                x => x.Run(CancellationToken.None),
                IMyJob.CronSchedule,
                options);

In this case, attribute has to be on the interface level. Not sure yet if attribute has any correlation to invisibility timeout though.

Let me know if any of these changes yield a better flow.

@azygis
Copy link
Collaborator

azygis commented Mar 15, 2024

@alexrosenfeld10 one more suspicious thing I noticed and forgot to ask. What's up with state.id jumping from 4348 straight to 15858 in 30min? How are there seemingly 10k more state rows in such a short period?

@alexrosenfeld10
Copy link
Author

alexrosenfeld10 commented Mar 19, 2024

@azygis apologies for the delay, I have been sick.

While it seems crazy, I think that's actually correct. There are thousands of other jobs running in this application, most of which are very quick and lightweight jobs to trigger other heavier weight http requests between two separate entities. They get triggered in batches, and this was one of the first batches to run so the load was the highest (subsequent batches will be smaller, first one of each rollout phase is big).

As I understand it, each job takes up minimum 3 state record counts (Enqueued, Processing, Succeeded), so ~3-5k short running jobs could very easily hit those numbers. Right now the state counter is 507348.

@azygis
Copy link
Collaborator

azygis commented Mar 19, 2024

Thanks for the update. Didn't mean it's "wrong" per se, was more curious whether it's true. If it is, it is.

Please try the suggestions as well once you're able. Two separate workers picking up the same job is usually an indication that invisibility timeout has passed.

@alexrosenfeld10
Copy link
Author

@azygis the problem is less "two workers picking up the same job" and more "job gets cut off and never picked back up again".

I did refactor my job to split things up a bit (which means they run shorter) and also to handle the cancellation token cancel more correctly. It helped and things seem more stable, so I will probably end up moving forward with that approach making this a "non issue"... that said, I still feel like there are issues here to do with jobs not getting resumed properly. Here's a new question - what happens when:

  1. Job runs long (past the invis timeout)
  2. Hangfire says "job ran long, it's not visible anymore, let's start another one"
  3. Job has [DisableConcurrentExecution] on it

What happens when Hangfire tries to restart the job? Would it fail to start because of the DisableConcurrentExecutionAttribute?


Separate question - if I wanted to, how do I change the invisibility timeout? I can't find it in any of the config callbacks. I'm on latest version of both HF and HF PG

@azygis
Copy link
Collaborator

azygis commented Mar 19, 2024

@alexrosenfeld10 I was thinking that a second worker picking up the same job has something to do with it not getting requeued again. There's no proof though as it's hard to replicate.

Checked how it would work with the attribute - essentially what the attribute does is tries to place a lock on a resource when the job decorated with it is starting (OnPerforming hook). If it fails to take the lock (aka that particular job is already locked) it throws, so it wouldn't enter the implementation of the job at all. I believe it would not enter the processing state but would go from Enqueued straight to Failed.

Note: I did not test it, just looked at the code of how attribute is built and what we do with the locking.


To change the invisibility timeout, use an overload that accepts the storage options. Based on what you provided earlier, it would be like so (watch out for rogue braces, wrote this without an IDE):

    registry
        .AddHangfire((sp, config) => config.UsePostgreSqlStorage(c =>
            c.UseNpgsqlConnection(sp.GetRequiredService<IDbRepository>().ConnectionString), new PostgreSqlStorageOptions { InvisibilityTimeout = TimeSpan.FromDays(1) }))
        .AddHangfireServer();

@alexrosenfeld10
Copy link
Author

alexrosenfeld10 commented Mar 19, 2024

Gotcha. Btw, I've noticed that since implementing the cancellation token handling, when the job gets cancelled it just ends in state Succeeded and doesn't get re-queued. Is this expected behavior? Am I supposed to handle the cancellationToken.IsCancellationRequested check and return to exit the job gracefully (which is what I do right now), or am I supposed to ThrowIfCancellationRequested() in order for hangfire to re-queue it?

It seems like if I just check and exit, hangfire thinks the job finished with success and doesn't get re-queued.


Thanks - I'll try that out.

@azygis
Copy link
Collaborator

azygis commented Mar 19, 2024

It is up to you how to handle it gracefully. You can throw and it will be retried when restarted (as long as you don't prevent it with AutomaticRetry, you can exit silently, you can schedule/enqueue it before exiting.

If you just leave the job by returning from it, Hangfire has no way of knowing it actually was stopped mid-way and you want to resume it later, since the state of the job can be invalid to start it again. Check if PerformContext has anything like Cancel flag or similar (PerformingContext has it IIRC), maybe it would have a special handling. But that's not storage-specific so I can't answer if it would be handled differently.

@alexrosenfeld10
Copy link
Author

alexrosenfeld10 commented Mar 19, 2024

@azygis Thanks, got it. I'll work on implementing all this and look to get the job system stable. Thanks for all the discussion, this has been super helpful.

I'd still love to get to the bottom of why the job doesn't re-queue when it should, but overall, I definitely have the tools to refactor things and make the system stable from this conversation.

@azygis
Copy link
Collaborator

azygis commented Mar 19, 2024

@alexrosenfeld10 it was helpful to me as well to know a bit more about the flow. Unfortunately how and why is a question for Hangfire repo as we're pretty much doing what Hangfire tells us to, from the storage side.

I will close the issue here and wish you luck, hopefully they'll reply on their issue in due time.

@azygis azygis closed this as completed Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants