v3: Refactor attempt creation to be worker requested #1077

ericallam · 2024-04-30T16:35:44Z

Changes

Lazy attempt creation:

The "worker" side now is responsible for requesting a new attempt
Currently this is only implemented on the "dev" side
Moves the visibility queue outside of redis and into a graphile job, which we are now using to fail runs that are in progress and hit the visibility timeout (unless the run is pending)
Better handle cancelling runs when dev reconnects to a different web server instance
This should allow prod workers to retry without checkpointing and without going back into the queue
Each attempt gets a fresh process. The process is killed after the attempt completes and there can never be more than two processes up per run.

Other changes:

Clear paused states before retry
Detect and handle unrecoverable worker errors
Remove checkpoints after successful push
Permanently switch to DO hosted busybox image
Fix IPC timeout issue, or at least handle it more gracefully
Handle checkpoint failures
Basic chaos monkey for checkpoint testing
Stack traces are back in the dashboard
Display final errors on root span

Testing checklist

There have been many changes to what happens after attempt completion and before retries. It's not enough to test that tasks complete successfully. In all scenarios, failure needs to be tested as well, 3-4 retries should be enough. This will also ensure we test for memory leaks, particularly when combined with checkpoints.

General guidelines:

keep an eye on memory usage
verify checkpoint size of all attempts
when using waits, always force failure afterwards, not before
also test with retry delays >30s

All relevant catalog entries start with lazy- and the following payload format can be used with all of them:

{
  "delayInSeconds": 35,
  "forceError": true
}

New SDK

Dev

Prod

Old SDK

Dev

Prod

changeset-bot · 2024-04-30T16:35:47Z

🦋 Changeset detected

Latest commit: 2099d91

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

… reason

This reverts commit d137e4e.

nicktrn

Great work everyone 🤝

) * WIP worker TaskRunAttempt creation * Handling failing task runs that cannot create an attempt for whatever reason * Move the visibility queue stuff into a graphile job * Fixed task runs with unsanitized queue names * “Borrow” the code from alerts PR to get self hosted deployments working * Add an admin API endpoint to get info about the shared marqs queue * Allow admins to view any project metrics * start adding lazy attempts to prod * lazy attempt creation for prod workers * resurrect prod stack traces * add exception event to failed run spans * simplify dependency resumes * fix typecheck * fix merge * fresh process for all attempts * always try sigterm first * stop heartbeat timeout on non-inplace replace message * add missing ack on checkpoint creation service failure * bypass dequeue for retries with running worker * respect retry delays * crash runs with invalid run status for execution * remove debug logs * fix nack message * fix version locking * fresh attempt processes in dev and prod * improve handling of ipc timeouts * consider checkpoint failures on cancellation * add basic chaos monkey to checkpointer * changeset * control forced checkpoint simulation via env var * fix merge * kill old attempt processes before checkpointing * detailed perf logging for checkpointing * add coordinator otlp endpoint example * improve prod run cancellation * rename supports lazy attempts migration * fix graceful exit * fix retry mechanics * clear paused state before retry * remove checkpoint image after push * crash worker on unrecoverable errors * refactor unrecoverable error emit * switch to do hosted busybox image * increase wait for duration ipc timeout * add changeset for misc fixes * fix merge * fix retry delay span runId * fix dev retries * improve prod worker logging * log checkpoint sizes * add lazy attempts catalog entries * Fixed merge issue: use zodFetch, not wrapZodFetch * Revert "Fixed merge issue: use zodFetch, not wrapZodFetch" This reverts commit d137e4e. * importEnvVars uses wrapZodFetch now * add backwards compat for retries without checkpoints * handle more cases of unrecoverable runs * don't kill the child process if it shouldn't be killed --------- Co-authored-by: nicktrn <55853254+nicktrn@users.noreply.github.com> Co-authored-by: Matt Aitken <matt@mattaitken.com>

ericallam added 7 commits May 1, 2024 10:41

WIP worker TaskRunAttempt creation

5ed700d

Handling failing task runs that cannot create an attempt for whatever…

d1bdd0c

… reason

Move the visibility queue stuff into a graphile job

e24631e

Fixed task runs with unsanitized queue names

f124d94

“Borrow” the code from alerts PR to get self hosted deployments working

3b3b07a

Add an admin API endpoint to get info about the shared marqs queue

5ca9e56

Allow admins to view any project metrics

1bba5d5

ericallam force-pushed the v3/worker-attempt-creation branch from c822c89 to 1bba5d5 Compare May 1, 2024 09:42

nicktrn added 21 commits May 1, 2024 15:38

start adding lazy attempts to prod

14992f4

lazy attempt creation for prod workers

c75bbfd

resurrect prod stack traces

f53004d

add exception event to failed run spans

1919b6f

simplify dependency resumes

a86b2e6

fix typecheck

dcc9745

Merge branch 'main' into v3/worker-attempt-creation

4e18a42

fix merge

90153c3

fresh process for all attempts

0552a8e

Merge branch 'main' into v3/worker-attempt-creation

c7fee76

always try sigterm first

1286147

stop heartbeat timeout on non-inplace replace message

30b6c2c

add missing ack on checkpoint creation service failure

4ace3a4

bypass dequeue for retries with running worker

78a1e57

respect retry delays

1f11944

crash runs with invalid run status for execution

ba72219

Merge branch 'main' into v3/worker-attempt-creation

d60cf55

remove debug logs

5dfaf99

fix nack message

93dca36

fix version locking

bf79e6b

fresh attempt processes in dev and prod

6ad28b6

nicktrn added 11 commits May 24, 2024 16:07

rename supports lazy attempts migration

92e257f

fix graceful exit

1e8743d

fix retry mechanics

e913c41

clear paused state before retry

6aba347

remove checkpoint image after push

40a99f8

crash worker on unrecoverable errors

5e4b4a3

refactor unrecoverable error emit

bc71e2c

switch to do hosted busybox image

48aadea

increase wait for duration ipc timeout

127d1aa

add changeset for misc fixes

02ae3f8

Merge branch 'main' into v3/worker-attempt-creation

ef100ad

nicktrn mentioned this pull request May 24, 2024

v3: misc prod fixes #1095

Closed

nicktrn and others added 14 commits May 24, 2024 18:59

fix merge

0ad7b83

fix retry delay span runId

8e5b71d

fix dev retries

ee660a3

Merge branch 'main' into v3/worker-attempt-creation

b6e105a

Merge branch 'main' into v3/worker-attempt-creation

fed79e7

improve prod worker logging

8f378b2

log checkpoint sizes

839b349

add lazy attempts catalog entries

16a365f

Fixed merge issue: use zodFetch, not wrapZodFetch

d137e4e

Revert "Fixed merge issue: use zodFetch, not wrapZodFetch"

79b47fd

This reverts commit d137e4e.

importEnvVars uses wrapZodFetch now

23eb918

add backwards compat for retries without checkpoints

0e7e0df

handle more cases of unrecoverable runs

66c9186

don't kill the child process if it shouldn't be killed

2099d91

nicktrn approved these changes May 30, 2024

View reviewed changes

nicktrn merged commit e69ffd3 into main May 30, 2024
4 checks passed

nicktrn deleted the v3/worker-attempt-creation branch May 30, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3: Refactor attempt creation to be worker requested #1077

v3: Refactor attempt creation to be worker requested #1077

ericallam commented Apr 30, 2024 •

edited by nicktrn

changeset-bot bot commented Apr 30, 2024 •

edited

nicktrn left a comment

v3: Refactor attempt creation to be worker requested #1077

v3: Refactor attempt creation to be worker requested #1077

Conversation

ericallam commented Apr 30, 2024 • edited by nicktrn

Changes

Testing checklist

New SDK

Dev

Prod

Old SDK

Dev

Prod

changeset-bot bot commented Apr 30, 2024 • edited

🦋 Changeset detected

nicktrn left a comment

Choose a reason for hiding this comment

ericallam commented Apr 30, 2024 •

edited by nicktrn

changeset-bot bot commented Apr 30, 2024 •

edited