Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][experimental] Handle NCCL errors in accelerated DAGs #45307

Open
3 tasks
Tracked by #43830
stephanie-wang opened this issue May 13, 2024 · 0 comments
Open
3 tasks
Tracked by #43830

[core][experimental] Handle NCCL errors in accelerated DAGs #45307

stephanie-wang opened this issue May 13, 2024 · 0 comments
Assignees
Labels
accelerated-dag enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks usability

Comments

@stephanie-wang
Copy link
Contributor

stephanie-wang commented May 13, 2024

Description

Handle:

  • Application errors (python exceptions)
  • Peer actor failure
  • Network errors

Ideally, actors participating in the DAG should still be usable after the error is thrown.

Use case

No response

@stephanie-wang stephanie-wang added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks accelerated-dag and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2024
@stephanie-wang stephanie-wang self-assigned this May 13, 2024
@anyscalesam anyscalesam added this to the ADAG Developer Preview milestone May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerated-dag enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks usability
Projects
None yet
Development

No branches or pull requests

2 participants