api: Create logic for automatically cleaning up dead streams #2013

victorges · 2024-01-17T19:26:01Z

What does this pull request do? Explain your changes. (required)

This creates an API and a corresponding cronjob to clean-up stream that are marked active
but haven't been updated in a while. This is an improvement from the lazy approach that
only tried to clean-up the streams once they were accessed via the API, which also had the
undesirable effect of triggering expensive mutations from read-only requests.

Specific updates (required)

Create the new jobs API
Create the cronjbo action
Create a separate DB connection pool for background operations like this (and webhooks, and tasks)
Remove old lazy logic ✨

How did you test each of these updates (required)

Ran the queries on the DB
yarn test
Try it on staging

Does this pull request close any open issues?
Implements ENG-869

Checklist

I have read the CONTRIBUTING document.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.

vercel · 2024-01-17T19:26:07Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
livepeer-studio	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 6, 2024 4:14pm

packages/api/src/controllers/stream.ts

victorges · 2024-01-17T20:15:02Z

Ready for review! Could you take a look @leszko ?

leszko

Good work @victorges !

Added some comments, but in general I really like the approach and love that we'll start having this functionality!

leszko · 2024-01-18T09:25:26Z

.github/workflows/cron-streams-active-cleanup.yaml

@@ -0,0 +1,15 @@
+name: "cron: Active streams cleanup"


I wonder if we shouldn't run it as a kubernetes Cron Jobs. I know that currently we don't have a clear agreement on whether cron jobs should be GH Actions or Kubernetes CronJobs, but the advantages I see for integrating it in our Infra:

Alerts integration (Grafana Pager Duty)

(super minor) GH Actions load on the GH workers

(minor) security and passing admin LP token

Anyway, I leave it up to you.

Even though I agree it would be much better to run cronjobs, I'd leave that as a separate change for the future, given we have other (more critical) workflows already running on GitHub actions, mainly usage billing. We should probably bring this up as a tech debt to be prioritized eventually though.

leszko · 2024-01-18T09:29:16Z

packages/api/src/controllers/stream.ts

@@ -229,6 +234,7 @@ function activeCleanupOne(
  }

  setImmediate(async () => {
+    await cleanupSemaphore.wait();


The "right" solution would be to have a separate connection pool for the cleanup work. This is a workaround. I wonder how difficult would it be to implement a separate connection pool for the cleanup.

I've implemented the separate connection pool here!
ef6df93

I'd recommend reviewing that commit separately, since I touched several different files (also moved webhooks, tasks and usage logic to the jobs conn pool).

.github/workflows/cron-streams-active-cleanup.yaml

leszko · 2024-01-18T09:36:33Z

.github/workflows/cron-streams-active-cleanup.yaml

+
+on:
+  schedule:
+    - cron: "*/10 * * * *" # run every 10 minutes


How about running it every 5 min and getting rid of all the opportunistic cleanup? I think this job should be the only place where we do the stream cleanup.

Otherwise, we risk not exposing a lot of bugs + it's super wrong that making a GET call to studio makes a change in the Studio DB.

Super scary, but I also would love to remove the opportunistic logic. Implemented here! f70dfe3

Notice that with a 5 minute frequency, the worst case delay for a stream to be activecleanedup will be 10 minutes. I think making the github action more frequent would not be that good though given limited runners etc, so we can consider that once we move to an actual cron in our infra. WDYT?

packages/api/src/util.ts

packages/api/src/webhooks/cannon.test.ts

victorges · 2024-04-26T12:58:34Z

packages/api/src/controllers/webhook.test.ts

+      let hooksCalled = 0;
+      for (let i = 0; i < 5 && hooksCalled < 2; i++) {
+        await sleep(500);
+
+        hooksCalled = 0;
+        for (const id of [webhookResJson.id, webhookResJson2.id]) {
+          const res = await client.get(`/webhook/${id}`);
+          expect(res.status).toBe(200);
+          const { status } = (await res.json()) as Webhook;
+          if (status.lastTriggeredAt >= now) {
+            hooksCalled++;
+          }
+        }
+      }
+      expect(hooksCalled).toBe(2);


FTR: this was the very cryptic failure I was getting from the tests. What has happening was that, after creating the separate jobs DB pool, the test below (records webhook logs) started failing because the webhook created there was begin called twice, one additional time with the event triggered in this test.

The problem was that this test never really waited (or even checked) that the webhooks created here were actually triggered. So the test finished while leaving the background runner calling the webhooks for it. This probably caused some flakiness every now and then as well.

What has happening was that, after switching to the separate DB pool, the very first event handled by the WebhookCannon took a little longer since it had to setup a new connection with Postgres (that hadn't been already initialized from the usual test setup). This small delay was already enough for the next test to start and create another webhook, which then got called for the same event. I could reproduce this by making a dummy query on the jobsDb in this test, which then caused the failure to stop cause the pool was then already initialized.

Fixed this by correcting this test logic tho, which was not even checking that the webhooks were being called. Explaining here only cause it seems so unrelated to the changes here, but they actually were.

Don't filter on the outer level, but rather make sure the pipeline supports an array of child streams.

Now we clean all of them, but we merge before the cleanup processing only the session once. No easy way to catch that on tests.

Also parallelize parent streams

There are some bugged sessions that got isActive set to them

leszko

Added 2 comments. Other than that, LGMT 👍

packages/api/src/controllers/stream.ts

.github/workflows/cron-streams-active-cleanup.yaml

This reverts commit 8621089. It was a staging only thing, likely from a development version.

victorges requested a review from a team as a code owner January 17, 2024 19:26

victorges marked this pull request as draft January 17, 2024 19:26

vercel bot deployed to Preview January 17, 2024 19:26 View deployment

github-advanced-security bot found potential problems Jan 17, 2024

View reviewed changes

packages/api/src/controllers/stream.ts Fixed Show fixed Hide fixed

vercel bot deployed to Preview January 17, 2024 19:44 View deployment

github-advanced-security bot found potential problems Jan 17, 2024

View reviewed changes

packages/api/src/controllers/stream.ts Dismissed Show dismissed Hide dismissed

vercel bot deployed to Preview January 17, 2024 20:13 View deployment

victorges marked this pull request as ready for review January 17, 2024 20:14

victorges requested a review from leszko January 17, 2024 20:14

vercel bot deployed to Preview January 17, 2024 21:00 View deployment

victorges force-pushed the vg/feat/active-cleanup-cron branch from 076f89b to 0dd1ea3 Compare January 18, 2024 01:33

vercel bot deployed to Preview January 18, 2024 01:35 View deployment

leszko reviewed Jan 18, 2024

View reviewed changes

victorges mentioned this pull request Mar 28, 2024

api: Attempt processing recording up to 5 times #2119

Merged

4 tasks

victorges force-pushed the vg/feat/active-cleanup-cron branch from 0dd1ea3 to 22e0160 Compare April 24, 2024 18:15

vercel bot deployed to Preview April 24, 2024 18:21 View deployment

vercel bot had a problem deploying to Preview April 25, 2024 08:24 Failure

vercel bot deployed to Preview April 25, 2024 08:33 View deployment

vercel bot deployed to Preview April 25, 2024 14:14 View deployment

victorges force-pushed the vg/feat/active-cleanup-cron branch from 007bbc1 to f2a4ad4 Compare April 25, 2024 14:36

vercel bot had a problem deploying to Preview April 25, 2024 14:39 Failure

victorges force-pushed the vg/feat/active-cleanup-cron branch from f2a4ad4 to 10eea44 Compare April 25, 2024 14:58

vercel bot deployed to Preview April 25, 2024 15:02 View deployment

victorges force-pushed the vg/feat/active-cleanup-cron branch from 10eea44 to fb9aa19 Compare April 26, 2024 12:44

vercel bot deployed to Preview April 26, 2024 12:48 View deployment

victorges force-pushed the vg/feat/active-cleanup-cron branch from fb9aa19 to eec5d11 Compare April 26, 2024 12:50

vercel bot deployed to Preview April 26, 2024 12:54 View deployment

victorges commented Apr 26, 2024

View reviewed changes

api/stream: Make size of clean-up more predictable

85967f8

Don't filter on the outer level, but rather make sure the pipeline supports an array of child streams.

vercel bot deployed to Preview May 4, 2024 01:25 View deployment

api/stream: Remove test for deduping by sessionId

ffe8729

Now we clean all of them, but we merge before the cleanup processing only the session once. No easy way to catch that on tests.

vercel bot deployed to Preview May 4, 2024 01:39 View deployment

api/stream: Stop using lodash

b953e65

vercel bot deployed to Preview May 4, 2024 10:47 View deployment

api/stream: Run clean-up synchronously on API

0c2fb67

Also parallelize parent streams

vercel bot deployed to Preview May 4, 2024 10:55 View deployment

api/stream: Default streamId to empty str

d2d7fbe

vercel bot deployed to Preview May 4, 2024 11:21 View deployment

api/stream: Parallelize child stream processing

ceea027

vercel bot deployed to Preview May 4, 2024 15:47 View deployment

api/stream: Only use isActive field from streams

8621089

There are some bugged sessions that got isActive set to them

vercel bot deployed to Preview May 5, 2024 16:32 View deployment

leszko approved these changes May 6, 2024

View reviewed changes

packages/api/src/controllers/stream.ts Outdated Show resolved Hide resolved

.github/workflows/cron-streams-active-cleanup.yaml Show resolved Hide resolved

Revert "api/stream: Only use isActive field from streams"

5724c14

This reverts commit 8621089. It was a staging only thing, likely from a development version.

vercel bot deployed to Preview May 6, 2024 13:18 View deployment

api/stream: Fallback to createdAt if lastSeen never set

1178457

vercel bot deployed to Preview May 6, 2024 13:25 View deployment

api/stream: Remove monkey-patching of stream objects 😍

269e235

vercel bot deployed to Preview May 6, 2024 13:27 View deployment

[DEV] Increase log level to debug in staging

4ba4937

vercel bot deployed to Preview May 6, 2024 13:30 View deployment

api/stream: Only trigger recording processing when stream goes offline

ce4d436

vercel bot deployed to Preview May 6, 2024 16:11 View deployment

api/stream: Check child streams shouldCleanupIsActive as well

63d39f7

vercel bot deployed to Preview May 6, 2024 16:14 View deployment

victorges merged commit 722e01f into master May 6, 2024
13 checks passed

victorges deleted the vg/feat/active-cleanup-cron branch May 6, 2024 19:37

victorges mentioned this pull request May 9, 2024

api: Create entrypoints for long-running jobs #2174

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api: Create logic for automatically cleaning up dead streams #2013

api: Create logic for automatically cleaning up dead streams #2013

victorges commented Jan 17, 2024 •

edited

vercel bot commented Jan 17, 2024 •

edited

victorges commented Jan 17, 2024

leszko left a comment

leszko Jan 18, 2024

victorges Apr 25, 2024

leszko Jan 18, 2024

victorges Apr 25, 2024 •

edited

leszko Jan 18, 2024

victorges Apr 26, 2024

victorges Apr 26, 2024

leszko left a comment

api: Create logic for automatically cleaning up dead streams #2013

api: Create logic for automatically cleaning up dead streams #2013

Conversation

victorges commented Jan 17, 2024 • edited

vercel bot commented Jan 17, 2024 • edited

victorges commented Jan 17, 2024

leszko left a comment

Choose a reason for hiding this comment

leszko Jan 18, 2024

Choose a reason for hiding this comment

victorges Apr 25, 2024

Choose a reason for hiding this comment

leszko Jan 18, 2024

Choose a reason for hiding this comment

victorges Apr 25, 2024 • edited

Choose a reason for hiding this comment

leszko Jan 18, 2024

Choose a reason for hiding this comment

victorges Apr 26, 2024

Choose a reason for hiding this comment

victorges Apr 26, 2024

Choose a reason for hiding this comment

leszko left a comment

Choose a reason for hiding this comment

victorges commented Jan 17, 2024 •

edited

vercel bot commented Jan 17, 2024 •

edited

victorges Apr 25, 2024 •

edited