Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose node death information in dashboard #45320

Merged
merged 16 commits into from
May 22, 2024

Conversation

ruisearch42
Copy link
Contributor

@ruisearch42 ruisearch42 commented May 14, 2024

Why are these changes needed?

This is one PR of a series to better propagate and expose node death information.

Background: in Ray, a node can be either ALIVE or DEAD, and the death reason of a node could be unexpected (e.g., crash) or expected (e.g., idle termination or spot preemption). Currently, GCS knows the actual death reason, but this information is not shown to the users. As a result, users might think a node crashes when they see it is DEAD, while in reality these could be expected scenarios such as spot preemption. This may give users the wrong impression that Ray is unstable. In the series of changes, we are going to better propagate and expose the node death info.

This PR does the following:

  • Expose node death info in dashboard's cluster page where the list of nodes is shown

See more details in design doc:
https://docs.google.com/document/d/1tn6Uj-SoEAaBu5_HWl4dNo3JBzTd_w09RPsIWQpUU_s/edit

Screenshots

Before change:
Screenshot 2024-05-20 at 9 13 43 AM

After change:
Screenshot 2024-05-20 at 11 30 37 AM
Screenshot 2024-05-20 at 11 30 49 AM

Related PRs

#45128
#45357

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@ruisearch42 ruisearch42 self-assigned this May 16, 2024
Copy link
Contributor

@alanwguo alanwguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a screenshot of the change in this PR?

dashboard/client/src/pages/node/NodeRow.tsx Outdated Show resolved Hide resolved
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@ruisearch42
Copy link
Contributor Author

Can you include a screenshot of the change in this PR?

Sure, PFA.
Screenshot 2024-05-17 at 3 59 50 PM

@ruisearch42 ruisearch42 added the go Trigger full test run on premerge label May 19, 2024
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have the screenshot in the PR description: with and without the pop-up.

dashboard/client/src/pages/node/NodeRow.tsx Outdated Show resolved Hide resolved
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@ruisearch42 ruisearch42 marked this pull request as ready for review May 20, 2024 16:39
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@scottsun94
Copy link
Contributor

scottsun94 commented May 20, 2024

Can we hide the "expand.." button when there is no message? We can just show - there?
Screenshot 2024-05-20 at 10 38 47 AM

dashboard/utils.py Outdated Show resolved Hide resolved
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Copy link
Contributor

@alanwguo alanwguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

dashboard/utils.py Outdated Show resolved Hide resolved
Co-authored-by: Alan Guo <aguo@aguo.software>
Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
@anyscalesam
Copy link
Collaborator

Reviewed with @ruisearch42 > after these set of 2 PRs there should be two additional and we should be wrapped up with the Expose Ray failures stability project.

Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dashboard/utils.py Show resolved Hide resolved
dashboard/utils.py Outdated Show resolved Hide resolved
dashboard/utils.py Show resolved Hide resolved
ruisearch42 and others added 3 commits May 21, 2024 13:36
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
dashboard/utils.py Outdated Show resolved Hide resolved
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
@jjyao jjyao merged commit c8412a8 into ray-project:master May 22, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Trigger full test run on premerge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants