Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Raylet should wait for RuntimeEnvAgent to start before receiving tasks. #45353

Open
rynewang opened this issue May 15, 2024 · 8 comments · May be fixed by #45513
Open

[core] Raylet should wait for RuntimeEnvAgent to start before receiving tasks. #45353

rynewang opened this issue May 15, 2024 · 8 comments · May be fixed by #45513
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-runtime-env Issues related to Ray environment dependencies P0 Issue that must be fixed in short order

Comments

@rynewang
Copy link
Contributor

What happened + What you expected to happen

In a node start up:

  1. raylet starts
  2. raylet spawns RuntimeEnvAgent
  3. raylet spawns workers
  4. raylet start receiving tasks, assigning to workers

Here we have an race condition: (2) should happen-before (4) but we don't have any code to do the waiting. If the tasks are received and started before the runtime env agent is ready to receive requests, we can have runtime env set up failures.

Note the waiting is not "process started" but needs to go through a HTTP probe. This means we need:

  1. an extra endpoint in RE Agent GET /ping which returns whatever.
  2. raylet polls every 1s until got a 200 OK.

Versions / Dependencies

master

Reproduction script

N/A

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@rynewang rynewang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 15, 2024
@rynewang
Copy link
Contributor Author

cc @hongchaodeng @jjyao

@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 20, 2024
@rynewang rynewang self-assigned this May 20, 2024
@rynewang rynewang added P0 Issue that must be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
@yx367563
Copy link

@rynewang Hi, is it convenient to say from which Ray version this problem started, We recently wanted to update the Ray version used on our system (currently using 2.9.0), but found some new problems with Ray 2.20.0 as feedback in the github issue (autoscaler instability, RuntimeEnvSetupError, etc.). So I'm looking for a slightly more stable version to upgrade, Thank you!

@jjyao jjyao added the core-runtime-env Issues related to Ray environment dependencies label May 21, 2024
@jjyao
Copy link
Contributor

jjyao commented May 21, 2024

@yx367563 could you elaborate more on the issues you are facing. Have you created github issues for them?

@DmitriGekhtman
Copy link
Contributor

P0 label reflects it already -- but the issue is quite severe; it causes intermittent failures of any production job in which workers access a runtime environment.

@rynewang
Copy link
Contributor Author

can reproduce - with a slow rt env agent + task pressure

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented May 21, 2024

Nice!
"task pressure" certainly describes our use-case
(we schedule many tasks per workload :) )

@yx367563
Copy link

@yx367563 could you elaborate more on the issues you are facing. Have you created github issues for them?

@jjyao @rynewang The problem faced is in this issue (#45311). I mean, can you know which Ray version this problem was introduced from? I will temporarily avoid upgrading to this version of Ray.

@rynewang
Copy link
Contributor Author

@yx367563 we don't have an accurate "culprit" commit. You can stay at your current version until this issue is resolved (and to the next version), or do some stress testing with your workload to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-runtime-env Issues related to Ray environment dependencies P0 Issue that must be fixed in short order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants