run-queue: Don't ack failed commands #6104

martinpitt · 2024-03-19T08:38:44Z

These are an infrastructure error which needs to be fixed. Instead of draining all test requests (which are easy to recover, though) and statistics update requests (which are a lot harder to recover), keep them in the queue. Either another runner will have more luck, or they keep getting kicked around until we fix things.

I messed up the recent container cleanup on OpenShift, and it ate our whole statistics queue without actually processing it. This also often causes abandoned test statuses (which are "pending" or "in progress") without ever finishing, when some random network/S3 error occurs.

These are an infrastructure error which needs to be fixed. Instead of draining all test requests (which are easy to recover, though) and statistics update requests (which are a lot harder to recover), keep them in the queue. Either another runner will have more luck, or they keep getting kicked around until we fix things.

allisonkarlitskaya

To be honest, this is what I thought it was already doing, but changing this now scares the bejeebers out of me... Can we please first add some queue option for maximum redelivery attempts before we create this infinite loop waiting to happen?

martinpitt · 2024-03-20T04:27:39Z

Hmm, I specifically don't want a max delivery number. We get an email notification if there is a queue crash, and need to react to them anyway. Just right now we have to manually tests-trigger --requeue tests or lose statistics.

martinpitt · 2024-03-20T04:52:15Z

https://www.rabbitmq.com/docs/queues says that there isn't much control over this. There's a "time-to-live" option and a queue length limit, but neither sounds useful to us.

martinpitt · 2024-03-20T04:55:40Z

We get an email notification

That said, we don't get these for the CentOS CI deployment, as we don't set the TEST_NOTIFICATION* env vars. That's an omission, I'll look into that.

martinpitt · 2024-03-20T06:50:34Z

I remember again: We don't send email from CentOS CI as there is no working public MX for redhat.com. We only get the internal one from our e2e/PSI machines:

❱❱❱ dig MX redhat.com
;; ANSWER SECTION:
redhat.com.             3600    IN      MX      10 us-smtp-inbound-2.mimecast.com.
redhat.com.             3600    IN      MX      10 us-smtp-inbound-1.mimecast.com.

and in python3:

>>> s = smtplib.SMTP('us-smtp-inbound-1.mimecast.com')

that just hangs.

martinpitt · 2024-03-20T14:14:30Z

Configure/set up a max retry, and send items to a dead letter exchange after that.

martinpitt · 2024-03-21T05:36:00Z

@allisonkarlitskaya FTR, I won't work on that right now. It's too involved, and I really need to work on something non-CI for a while.

As a compromise, would you be willing to only keep "statistics" queue entries, and silently discard the public/rhel queues? At least the latter are easy to recover.

allisonkarlitskaya · 2024-03-21T07:59:58Z

@martinpitt think about what would have happened if we had this on, with the recent statistics failures. We would have ended up with 100% CPU usage 24/7, right?

I'd prefer to drop the messages on the floor, honestly...

martinpitt · 2024-03-21T08:03:35Z

ended up with 100% CPU usage 24/7, right

Not quite as bad. cockpit-tasks would crash, cockpit-tasks@1.service would auto-restart after 5 minutes, so at most there's one attempt every 5 mins.

I'd prefer to drop the messages on the floor, honestly...

That's precisely the bug I'm trying to fix. It broke our weather report/prometheus for a week, and wasn't visible in inspect-queue.

martinpitt requested a review from allisonkarlitskaya March 19, 2024 08:42

allisonkarlitskaya reviewed Mar 19, 2024

View reviewed changes

martinpitt requested a review from allisonkarlitskaya March 20, 2024 08:40

martinpitt marked this pull request as draft March 20, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run-queue: Don't ack failed commands #6104

run-queue: Don't ack failed commands #6104

martinpitt commented Mar 19, 2024

allisonkarlitskaya left a comment

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 21, 2024

allisonkarlitskaya commented Mar 21, 2024

martinpitt commented Mar 21, 2024

run-queue: Don't ack failed commands #6104

Are you sure you want to change the base?

run-queue: Don't ack failed commands #6104

Conversation

martinpitt commented Mar 19, 2024

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 20, 2024

martinpitt commented Mar 21, 2024

allisonkarlitskaya commented Mar 21, 2024

martinpitt commented Mar 21, 2024