Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update the component activity api to take pending tasks into account #45315

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sofianhnaide
Copy link
Contributor

@sofianhnaide sofianhnaide commented May 13, 2024

Why are these changes needed?

This change updates the component activity endpoint to add an option to consider a driver active only when the driver has pending tasks. This is important for idle detection, as previously, long-lived Python processes (like notebooks) is harder to detect since the process was always considered active.

This change also updates the last activity time for the driver to the time of the last active tasks in it. This is added as an option to not break the existing behavior and leave it to the client to decide.

ray-activity-2.mp4

Manual test

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sofianhnaide sofianhnaide force-pushed the sofian/update-component-activity-api branch from a3582ad to 1625f2d Compare May 14, 2024 22:50
detail=True,
timeout=timeout,
)
tasks = await self._state_api.list_tasks(option=option)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using state-api here might not be the right choice. state-api is designed for observability use-cases and has a 10,000 task limit. We may miss some pending tasks.

With observability apis, we may even turn it off for some clusters when the cluster is too large and its impacting the main workload.

I recommend checking with @rickyyx or @rkooo567 if there's a more reliable API to use here. It might actually be better for core to track the pending task liveness in GCS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is a limit this is a no-go. We need this to be reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the definition of "last active time of the driver"?

If there's task currently running/pending, the last active time is now right?
If there's no tasks running, then the last active time is the latest time of finished/errored task.

I feel a seperate endpoint to GCS directly that asks for this should be better. Or we also have the idle_time information of each node in GCS now, although it doesn't have an API exposed yet I believe. The idle time information is derived from worker and resource usage of each node, which I believe is used by the autoscaler idle detection, will this be a better source of truth? Or am I understanding the requirements wrong.

@sofianhnaide sofianhnaide requested a review from rickyyx May 15, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants