-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run-queue: Don't ack failed commands #6104
base: main
Are you sure you want to change the base?
Conversation
These are an infrastructure error which needs to be fixed. Instead of draining all test requests (which are easy to recover, though) and statistics update requests (which are a lot harder to recover), keep them in the queue. Either another runner will have more luck, or they keep getting kicked around until we fix things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, this is what I thought it was already doing, but changing this now scares the bejeebers out of me... Can we please first add some queue option for maximum redelivery attempts before we create this infinite loop waiting to happen?
Hmm, I specifically don't want a max delivery number. We get an email notification if there is a queue crash, and need to react to them anyway. Just right now we have to manually |
https://www.rabbitmq.com/docs/queues says that there isn't much control over this. There's a "time-to-live" option and a queue length limit, but neither sounds useful to us. |
That said, we don't get these for the CentOS CI deployment, as we don't set the |
I remember again: We don't send email from CentOS CI as there is no working public MX for redhat.com. We only get the internal one from our e2e/PSI machines:
and in python3:
that just hangs. |
Configure/set up a max retry, and send items to a dead letter exchange after that. |
@allisonkarlitskaya FTR, I won't work on that right now. It's too involved, and I really need to work on something non-CI for a while. As a compromise, would you be willing to only keep "statistics" queue entries, and silently discard the public/rhel queues? At least the latter are easy to recover. |
@martinpitt think about what would have happened if we had this on, with the recent statistics failures. We would have ended up with 100% CPU usage 24/7, right? I'd prefer to drop the messages on the floor, honestly... |
Not quite as bad. cockpit-tasks would crash,
That's precisely the bug I'm trying to fix. It broke our weather report/prometheus for a week, and wasn't visible in inspect-queue. |
These are an infrastructure error which needs to be fixed. Instead of draining all test requests (which are easy to recover, though) and statistics update requests (which are a lot harder to recover), keep them in the queue. Either another runner will have more luck, or they keep getting kicked around until we fix things.
I messed up the recent container cleanup on OpenShift, and it ate our whole statistics queue without actually processing it. This also often causes abandoned test statuses (which are "pending" or "in progress") without ever finishing, when some random network/S3 error occurs.