Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run-queue: Don't ack failed commands #6104

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

martinpitt
Copy link
Member

These are an infrastructure error which needs to be fixed. Instead of draining all test requests (which are easy to recover, though) and statistics update requests (which are a lot harder to recover), keep them in the queue. Either another runner will have more luck, or they keep getting kicked around until we fix things.


I messed up the recent container cleanup on OpenShift, and it ate our whole statistics queue without actually processing it. This also often causes abandoned test statuses (which are "pending" or "in progress") without ever finishing, when some random network/S3 error occurs.

These are an infrastructure error which needs to be fixed. Instead of
draining all test requests (which are easy to recover, though) and
statistics update requests (which are a lot harder to recover), keep
them in the queue. Either another runner will have more luck, or they
keep getting kicked around until we fix things.
Copy link
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, this is what I thought it was already doing, but changing this now scares the bejeebers out of me... Can we please first add some queue option for maximum redelivery attempts before we create this infinite loop waiting to happen?

@martinpitt
Copy link
Member Author

Hmm, I specifically don't want a max delivery number. We get an email notification if there is a queue crash, and need to react to them anyway. Just right now we have to manually tests-trigger --requeue tests or lose statistics.

@martinpitt
Copy link
Member Author

https://www.rabbitmq.com/docs/queues says that there isn't much control over this. There's a "time-to-live" option and a queue length limit, but neither sounds useful to us.

@martinpitt
Copy link
Member Author

We get an email notification

That said, we don't get these for the CentOS CI deployment, as we don't set the TEST_NOTIFICATION* env vars. That's an omission, I'll look into that.

@martinpitt
Copy link
Member Author

I remember again: We don't send email from CentOS CI as there is no working public MX for redhat.com. We only get the internal one from our e2e/PSI machines:

❱❱❱ dig MX redhat.com
;; ANSWER SECTION:
redhat.com.             3600    IN      MX      10 us-smtp-inbound-2.mimecast.com.
redhat.com.             3600    IN      MX      10 us-smtp-inbound-1.mimecast.com.

and in python3:

>>> s = smtplib.SMTP('us-smtp-inbound-1.mimecast.com')

that just hangs.

@martinpitt martinpitt marked this pull request as draft March 20, 2024 14:13
@martinpitt
Copy link
Member Author

Configure/set up a max retry, and send items to a dead letter exchange after that.

@martinpitt
Copy link
Member Author

@allisonkarlitskaya FTR, I won't work on that right now. It's too involved, and I really need to work on something non-CI for a while.

As a compromise, would you be willing to only keep "statistics" queue entries, and silently discard the public/rhel queues? At least the latter are easy to recover.

@allisonkarlitskaya
Copy link
Member

@martinpitt think about what would have happened if we had this on, with the recent statistics failures. We would have ended up with 100% CPU usage 24/7, right?

I'd prefer to drop the messages on the floor, honestly...

@martinpitt
Copy link
Member Author

ended up with 100% CPU usage 24/7, right

Not quite as bad. cockpit-tasks would crash, cockpit-tasks@1.service would auto-restart after 5 minutes, so at most there's one attempt every 5 mins.

I'd prefer to drop the messages on the floor, honestly...

That's precisely the bug I'm trying to fix. It broke our weather report/prometheus for a week, and wasn't visible in inspect-queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants