Job worker randomly restarts? #213

krschacht · 2024-04-29T19:12:44Z

I've had SolidQueue setup and working for awhile on a low traffic app. But I sporadically get a job that takes a long time to complete (I just had one that took 7 minutes). When I check the logs, it looks like the job starts, then the worker gets killed and takes a bit to restart. After it restarts, it picks up the job again and completes it quickly (the usual 10 seconds). I'm struggling to figure out the root cause of this since my UI needs this job to update things so the user is just sitting there.

Notably, the pattern is the sam: My app is warmed up and running. I'm taking a series of user actions on the front-end which each trigger a job to complete. I'm having success many times in a row, which I can tell because the UI quickly updates, and then randomly the UI stops updating. I check the logs and see it's doing this weird restart thing.

My database is postgres. I'm deployed to Render. I'm running Solid Queue's supervisor together with Puma (using plugin :solid_queue).

Here is the log output for a recent failure. I tried to delete lines that were definitely unrelated. Also, any lines beginning with "### " are simple puts messages from within my job:

[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:16:29.093278 #67]  INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performing GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) enqueued at 2024-04-29T18:16:29.024459609Z with arguments: 3357, 3
### GetNextAIMessageJob.perform(3357, 3, 1)
### Wait for reply
[SolidQueue] Restarting fork[67] (status: )
[SolidQueue] Starting Worker(pid=91, hostname=srv-cn7qcf7sc6pc73cbk1ug-64d5657b7b-4kdb6, metadata={:polling_interval=>0.1, :queues=>"*", :thread_pool_size=>5})
[1] - Worker 0 (PID: 103) booted in 0.0s, phase: 0
I, [2024-04-29T18:18:21.404711 #103]  INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Started GET "/cable" for 23.112.39.144 at 2024-04-29 18:18:21 +0000
I, [2024-04-29T18:18:21.413428 #103]  INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Started GET "/cable" [WebSocket] for 23.112.39.144 at 2024-04-29 18:18:21 +0000
I, [2024-04-29T18:18:21.413564 #103]  INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: websocket)
I, [2024-04-29T18:18:21.514497 #103]  INFO -- : Turbo::StreamsChannel is streaming from Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1
[SolidQueue] Pruning dead process 360 - {"polling_interval"=>0.1, "queues"=>"*", "thread_pool_size"=>5}
[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:23:10.309082 #91]  INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performing GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) enqueued at 2024-04-29T18:16:29.024459609Z with arguments: 3357, 3
### GetNextAIMessageJob.perform(3357, 3, 1)
### Wait for reply
### Finished GetNextAIMessageJob.perform(3357, 3)
I, [2024-04-29T18:23:24.835017 #91]  INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performed GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) in 14525.39ms
I, [2024-04-29T18:23:25.344451 #91]  INFO -- : [ActiveJob] Enqueued Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) to SolidQueue(default) with arguments: "Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1", {:content=>"<turbo-stream action=\"refresh\"></turbo-stream>"}
[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:23:25.365729 #91]  INFO -- : [ActiveJob] [Turbo::Streams::BroadcastStreamJob] [b0083a97-908d-4568-85ec-1ebceb374adf] Performing Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) from SolidQueue(default) enqueued at 2024-04-29T18:23:25.335394484Z with arguments: "Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1", {:content=>"<turbo-stream action=\"refresh\"></turbo-stream>"}
I, [2024-04-29T18:23:25.366418 #91]  INFO -- : [ActiveJob] [Turbo::Streams::BroadcastStreamJob] [b0083a97-908d-4568-85ec-1ebceb374adf] Performed Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) from SolidQueue(default) in 0.87ms
I, [2024-04-29T18:23:26.311872 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Started GET "/conversations/125/messages?version=1" for 108.162.245.166 at 2024-04-29 18:23:26 +0000
I, [2024-04-29T18:23:26.317496 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Processing by MessagesController#index as HTML
I, [2024-04-29T18:23:26.317594 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c]   Parameters: {"version"=>"1", "conversation_id"=>"125"}
[1] - Worker 1 (PID: 124) booted in 0.0s, phase: 0
I, [2024-04-29T18:23:28.218480 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c]   Rendered layout layouts/application.html.erb (Duration: 455.2ms | Allocations: 129565)
I, [2024-04-29T18:23:28.219448 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Completed 200 OK in 1902ms (Views: 448.0ms | ActiveRecord: 43.2ms | Allocations: 179138)

The text was updated successfully, but these errors were encountered:

rosa · 2024-04-30T09:32:26Z

Hey @krschacht, how odd! The restart you see in the logs is the Supervisor realising one of the workers has terminated, and replacing it with another one.

solid_queue/lib/solid_queue/supervisor.rb

Lines 150 to 151 in 1540864

    
           if supervised_fork = forks.delete(pid) 
        
             SolidQueue.logger.info "[SolidQueue] Restarting fork[#{status.pid}] (status: #{status.exitstatus})"

"Restarting" is the wrong word there, TBH, because it's actually replacing a terminated fork. I'll fix that in #208. In any case, what this indicates is that something outside Solid Queue is killing that worker 🤔 The lack of status code in your logs:

[SolidQueue] Restarting fork[67] (status: )

makes me think this might be a SIGKILL being sent to the worker.

Is it possible you're running something that kills processes that go over certain time or memory usage?

krschacht · 2024-04-30T20:49:45Z

Hi @rosa, thank you so much for the quick response! I recently moved from Heroku over to Render and I don't know this platform nearly as well, but I think your hunch is right that the app got killed because it exceed memory usage. I have a ticket into Render support to confirm this but that's my best read of my metrics. And indeed, if that's the case, then it looks like this issue is on me and has nothing to do with SolidQueue. I think it's safe to close out this Issue unless I get new clues that suggest otherwise.

While you're fixing the "Restarting ..." error message, it might be worth even adding a suggested cause and maybe add a conditional on blank exitstatus too. I'll add a note to the PR with a concrete suggestion.

rosa · 2024-05-18T18:07:33Z

Going to close this one as I already merged #208, with improved logging so this is not as obscure. Thanks again for raising this one, @krschacht!

rosa closed this as completed May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job worker randomly restarts? #213

Job worker randomly restarts? #213

krschacht commented Apr 29, 2024 •

edited

rosa commented Apr 30, 2024

krschacht commented Apr 30, 2024

rosa commented May 18, 2024

Job worker randomly restarts? #213

Job worker randomly restarts? #213

Comments

krschacht commented Apr 29, 2024 • edited

rosa commented Apr 30, 2024

krschacht commented Apr 30, 2024

rosa commented May 18, 2024

krschacht commented Apr 29, 2024 •

edited