-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose node death information in dashboard #45320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include a screenshot of the change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you have the screenshot in the PR description: with and without the pop-up.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Co-authored-by: Alan Guo <aguo@aguo.software> Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Reviewed with @ruisearch42 > after these set of 2 PRs there should be two additional and we should be wrapped up with the Expose Ray failures stability project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Why are these changes needed?
This is one PR of a series to better propagate and expose node death information.
Background: in Ray, a node can be either ALIVE or DEAD, and the death reason of a node could be unexpected (e.g., crash) or expected (e.g., idle termination or spot preemption). Currently, GCS knows the actual death reason, but this information is not shown to the users. As a result, users might think a node crashes when they see it is DEAD, while in reality these could be expected scenarios such as spot preemption. This may give users the wrong impression that Ray is unstable. In the series of changes, we are going to better propagate and expose the node death info.
This PR does the following:
See more details in design doc:
https://docs.google.com/document/d/1tn6Uj-SoEAaBu5_HWl4dNo3JBzTd_w09RPsIWQpUU_s/edit
Screenshots
Before change:
After change:
Related PRs
#45128
#45357
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.