You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collect a global counter of num_finished or num_failed tasks in the head node to export a metric.
The current distributed counter approach runs into problems with the node dies and the node's count of total finished or failed tasks gets wiped out.
We worked around this in the grafana dashboard by doing a max_over_time for each of these counts, but that can be very slow since we scan the past 14 days of time data
Versions / Dependencies
ray 2.21.0
Reproduction script
simple repro:
import ray
ray.init("auto")
@ray.remote
def foo():
return "hi"
ray.get([foo.remote() for _ in range(100)])
Open the grafana dashboard and go to the metrics page. See the tasks graph. If the number of tasks is very large and the cluster is alive for a long time, this graph can be too slow to even load.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
alanwguo
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 14, 2024
alanwguo
changed the title
[<Ray component: Core]
[Core] Change the source of the ray_tasks metric for finished or failed tasks to have a more accurate count.
May 14, 2024
rynewang
added
P1
Issue that should be fixed within a few weeks
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 20, 2024
What happened + What you expected to happen
Collect a global counter of num_finished or num_failed tasks in the head node to export a metric.
The current distributed counter approach runs into problems with the node dies and the node's count of total finished or failed tasks gets wiped out.
We worked around this in the grafana dashboard by doing a
max_over_time
for each of these counts, but that can be very slow since we scan the past 14 days of time dataVersions / Dependencies
ray 2.21.0
Reproduction script
simple repro:
Open the grafana dashboard and go to the metrics page. See the tasks graph. If the number of tasks is very large and the cluster is alive for a long time, this graph can be too slow to even load.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: