You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently upgraded from using RQ 1.12.0 to RQ 1.16.1. Upon deploying this upgrade to our production environment, we noticed RQ having periodic very high CPU events, starting about (but not exactly) once every 6 hours, lasting for ~30 minutes each, and gradually decreasing in frequency to once every 48-72 hours after ~2 weeks.
When redeploying the RQ container, the periodicity seemed to reset, with CPU spikes immediately starting again once every 6ish hours, and slowly decreasing, leading us to suspect this behavior is tied to the initial start time of our RQ workers.
After rolling back to RQ 1.12.0 (With no other related code changes or rollbacks), the problem disappears entirely. This leads us to suspect the issue is somehow related to changes to RQ's internals instead of our code (or at least how those changes are interacting with our scenario.)
Unfortunately, I don't have a good way to reproduce the behavior besides our production environment, as it seems related to the somewhat high throughput of RQ jobs. Our staging environment with identical setup but much lower job ingress does not exhibit this behavior. I looked through the release notes, open issues and closed PRs, but nothing particularly stood out as a possible culprit.
When looking into profiling snapshots, it seems like during the ~30 minutes of high CPU usage, our RQ workers throughput slows (as they idle more, fighting for CPU time with whatever is taking up CPU), causing a backup of queued jobs, which they then successfully burn through after the mysterious CPU spike finishes. As far as we can tell, no jobs are being failed, and there's no changes in job influx during these spikes.
I realize it's a long shot, but does this behavior ring any bells for you in what might be causing it in between RQ 1.13.0-RQ1.16.1? (Or do you have any other suggestions for things to investigate?)
More brief details on our environment:
RQ 1.12.0 / 1.16.1 running in Docker, managed via supervisord as the docker entrypoint per https://python-rq.org/patterns/supervisor/
Redis version: 7.0.15
Python 3.10.5
We are also using flask-RQ2@18.3 and flask-scheduler@0.13.1
We are running 10 worker processes (via flask rq worker <queues>) and 1 scheduler process (via flask rq scheduler) in the same container.
A fairly constant load of ~25 jobs/second on average in production. CPU usage is normally fairly consistent at 30% normally, rising to ~90% during these CPU events.
Please let me know if there's anything else I can share that would be helpful. Thank you so much!
The text was updated successfully, but these errors were encountered:
Hello RQ folks,
We recently upgraded from using RQ 1.12.0 to RQ 1.16.1. Upon deploying this upgrade to our production environment, we noticed RQ having periodic very high CPU events, starting about (but not exactly) once every 6 hours, lasting for ~30 minutes each, and gradually decreasing in frequency to once every 48-72 hours after ~2 weeks.
When redeploying the RQ container, the periodicity seemed to reset, with CPU spikes immediately starting again once every 6ish hours, and slowly decreasing, leading us to suspect this behavior is tied to the initial start time of our RQ workers.
After rolling back to RQ 1.12.0 (With no other related code changes or rollbacks), the problem disappears entirely. This leads us to suspect the issue is somehow related to changes to RQ's internals instead of our code (or at least how those changes are interacting with our scenario.)
Unfortunately, I don't have a good way to reproduce the behavior besides our production environment, as it seems related to the somewhat high throughput of RQ jobs. Our staging environment with identical setup but much lower job ingress does not exhibit this behavior. I looked through the release notes, open issues and closed PRs, but nothing particularly stood out as a possible culprit.
When looking into profiling snapshots, it seems like during the ~30 minutes of high CPU usage, our RQ workers throughput slows (as they idle more, fighting for CPU time with whatever is taking up CPU), causing a backup of queued jobs, which they then successfully burn through after the mysterious CPU spike finishes. As far as we can tell, no jobs are being failed, and there's no changes in job influx during these spikes.
I realize it's a long shot, but does this behavior ring any bells for you in what might be causing it in between RQ 1.13.0-RQ1.16.1? (Or do you have any other suggestions for things to investigate?)
More brief details on our environment:
RQ 1.12.0 / 1.16.1 running in Docker, managed via supervisord as the docker entrypoint per https://python-rq.org/patterns/supervisor/
Redis version: 7.0.15
Python 3.10.5
We are also using
flask-RQ2@18.3
andflask-scheduler@0.13.1
We are running 10 worker processes (via
flask rq worker <queues>
) and 1 scheduler process (viaflask rq scheduler
) in the same container.A fairly constant load of ~25 jobs/second on average in production. CPU usage is normally fairly consistent at 30% normally, rising to ~90% during these CPU events.
Please let me know if there's anything else I can share that would be helpful. Thank you so much!
The text was updated successfully, but these errors were encountered: