Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3: Refactor attempt creation to be worker requested #1077

Merged
merged 68 commits into from
May 30, 2024

Conversation

ericallam
Copy link
Member

@ericallam ericallam commented Apr 30, 2024

Changes

Lazy attempt creation:

  • The "worker" side now is responsible for requesting a new attempt
  • Currently this is only implemented on the "dev" side
  • Moves the visibility queue outside of redis and into a graphile job, which we are now using to fail runs that are in progress and hit the visibility timeout (unless the run is pending)
  • Better handle cancelling runs when dev reconnects to a different web server instance
  • This should allow prod workers to retry without checkpointing and without going back into the queue
  • Each attempt gets a fresh process. The process is killed after the attempt completes and there can never be more than two processes up per run.

Other changes:

  • Clear paused states before retry
  • Detect and handle unrecoverable worker errors
  • Remove checkpoints after successful push
  • Permanently switch to DO hosted busybox image
  • Fix IPC timeout issue, or at least handle it more gracefully
  • Handle checkpoint failures
  • Basic chaos monkey for checkpoint testing
  • Stack traces are back in the dashboard
  • Display final errors on root span

Testing checklist

There have been many changes to what happens after attempt completion and before retries. It's not enough to test that tasks complete successfully. In all scenarios, failure needs to be tested as well, 3-4 retries should be enough. This will also ensure we test for memory leaks, particularly when combined with checkpoints.

General guidelines:

  • keep an eye on memory usage
  • verify checkpoint size of all attempts
  • when using waits, always force failure afterwards, not before
  • also test with retry delays >30s

All relevant catalog entries start with lazy- and the following payload format can be used with all of them:

{
  "delayInSeconds": 35,
  "forceError": true
}

New SDK

Dev

  • immediate return
  • immediate return - with forced failure
  • 1s wait
  • 1s wait - with forced failure
  • 35s wait
  • 35s wait - with forced failure
  • single dependency
  • single dependency - with forced failure
  • batch dependency
  • batch dependency - with forced failure
  • consecutive 1s wait
  • consecutive 1s wait - with forced failure
  • consecutive 35s wait
  • consecutive 35s wait - with forced failure
  • consecutive single dependency
  • consecutive single dependency - with forced failure
  • consecutive batch dependency
  • consecutive batch dependency - with forced failure
  • 35s wait, single dep
  • 35s wait, single dep - with forced failure
  • 35s wait, batch dep
  • 35s wait, batch dep - with forced failure
  • single dep, batch dep
  • single dep, batch dep - with forced failure
  • single dep, 35s wait
  • single dep, 35s wait - with forced failure
  • batch dep, 35s wait
  • batch dep, 35s wait - with forced failure
  • batch dep, single dep
  • batch dep, single dep - with forced failure
  • a random selection of more complex tasks

Prod

  • immediate return
  • immediate return - with forced failure
  • 1s wait
  • 1s wait - with forced failure
  • 35s wait
  • 35s wait - with forced failure
  • single dependency
  • single dependency - with forced failure
  • batch dependency
  • batch dependency - with forced failure
  • consecutive 1s wait
  • consecutive 1s wait - with forced failure
  • consecutive 35s wait
  • consecutive 35s wait - with forced failure
  • consecutive single dependency
  • consecutive single dependency - with forced failure
  • consecutive batch dependency
  • consecutive batch dependency - with forced failure
  • 35s wait, single dep
  • 35s wait, single dep - with forced failure
  • 35s wait, batch dep
  • 35s wait, batch dep - with forced failure
  • single dep, batch dep
  • single dep, batch dep - with forced failure
  • single dep, 35s wait
  • single dep, 35s wait - with forced failure
  • batch dep, 35s wait
  • batch dep, 35s wait - with forced failure
  • batch dep, single dep
  • batch dep, single dep - with forced failure
  • a random selection of more complex tasks

Old SDK

Dev

  • immediate return
  • immediate return - with forced failure
  • 1s wait
  • 1s wait - with forced failure
  • 35s wait
  • 35s wait - with forced failure
  • single dependency
  • single dependency - with forced failure
  • batch dependency
  • batch dependency - with forced failure
  • consecutive 1s wait
  • consecutive 1s wait - with forced failure
  • consecutive 35s wait
  • consecutive 35s wait - with forced failure
  • consecutive single dependency
  • consecutive single dependency - with forced failure
  • consecutive batch dependency
  • consecutive batch dependency - with forced failure
  • 35s wait, single dep
  • 35s wait, single dep - with forced failure
  • 35s wait, batch dep
  • 35s wait, batch dep - with forced failure
  • single dep, batch dep
  • single dep, batch dep - with forced failure
  • single dep, 35s wait
  • single dep, 35s wait - with forced failure
  • batch dep, 35s wait
  • batch dep, 35s wait - with forced failure
  • batch dep, single dep
  • batch dep, single dep - with forced failure
  • a random selection of more complex tasks

Prod

  • immediate return
  • immediate return - with forced failure
  • 1s wait
  • 1s wait - with forced failure
  • 35s wait
  • 35s wait - with forced failure
  • single dependency
  • single dependency - with forced failure
  • batch dependency
  • batch dependency - with forced failure
  • consecutive 1s wait
  • consecutive 1s wait - with forced failure
  • consecutive 35s wait
  • consecutive 35s wait - with forced failure
  • consecutive single dependency
  • consecutive single dependency - with forced failure
  • consecutive batch dependency
  • consecutive batch dependency - with forced failure
  • 35s wait, single dep
  • 35s wait, single dep - with forced failure
  • 35s wait, batch dep
  • 35s wait, batch dep - with forced failure
  • single dep, batch dep
  • single dep, batch dep - with forced failure
  • single dep, 35s wait
  • single dep, 35s wait - with forced failure
  • batch dep, 35s wait
  • batch dep, 35s wait - with forced failure
  • batch dep, single dep
  • batch dep, single dep - with forced failure
  • a random selection of more complex tasks

Copy link

changeset-bot bot commented Apr 30, 2024

🦋 Changeset detected

Latest commit: 2099d91

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@ericallam ericallam force-pushed the v3/worker-attempt-creation branch from c822c89 to 1bba5d5 Compare May 1, 2024 09:42
@nicktrn nicktrn mentioned this pull request May 24, 2024
Copy link
Collaborator

@nicktrn nicktrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work everyone 🤝

@nicktrn nicktrn merged commit e69ffd3 into main May 30, 2024
4 checks passed
@nicktrn nicktrn deleted the v3/worker-attempt-creation branch May 30, 2024 10:05
jacobparis pushed a commit to jacobparis/trigger.dev that referenced this pull request Jun 1, 2024
)

* WIP worker TaskRunAttempt creation

* Handling failing task runs that cannot create an attempt for whatever reason

* Move the visibility queue stuff into a graphile job

* Fixed task runs with unsanitized queue names

* “Borrow” the code from alerts PR to get self hosted deployments working

* Add an admin API endpoint to get info about the shared marqs queue

* Allow admins to view any project metrics

* start adding lazy attempts to prod

* lazy attempt creation for prod workers

* resurrect prod stack traces

* add exception event to failed run spans

* simplify dependency resumes

* fix typecheck

* fix merge

* fresh process for all attempts

* always try sigterm first

* stop heartbeat timeout on non-inplace replace message

* add missing ack on checkpoint creation service failure

* bypass dequeue for retries with running worker

* respect retry delays

* crash runs with invalid run status for execution

* remove debug logs

* fix nack message

* fix version locking

* fresh attempt processes in dev and prod

* improve handling of ipc timeouts

* consider checkpoint failures on cancellation

* add basic chaos monkey to checkpointer

* changeset

* control forced checkpoint simulation via env var

* fix merge

* kill old attempt processes before checkpointing

* detailed perf logging for checkpointing

* add coordinator otlp endpoint example

* improve prod run cancellation

* rename supports lazy attempts migration

* fix graceful exit

* fix retry mechanics

* clear paused state before retry

* remove checkpoint image after push

* crash worker on unrecoverable errors

* refactor unrecoverable error emit

* switch to do hosted busybox image

* increase wait for duration ipc timeout

* add changeset for misc fixes

* fix merge

* fix retry delay span runId

* fix dev retries

* improve prod worker logging

* log checkpoint sizes

* add lazy attempts catalog entries

* Fixed merge issue: use zodFetch, not wrapZodFetch

* Revert "Fixed merge issue: use zodFetch, not wrapZodFetch"

This reverts commit d137e4e.

* importEnvVars uses wrapZodFetch now

* add backwards compat for retries without checkpoints

* handle more cases of unrecoverable runs

* don't kill the child process if it shouldn't be killed

---------

Co-authored-by: nicktrn <55853254+nicktrn@users.noreply.github.com>
Co-authored-by: Matt Aitken <matt@mattaitken.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants