-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Raylet should wait for RuntimeEnvAgent to start before receiving tasks. #45353
Comments
@rynewang Hi, is it convenient to say from which Ray version this problem started, We recently wanted to update the Ray version used on our system (currently using 2.9.0), but found some new problems with Ray 2.20.0 as feedback in the github issue (autoscaler instability, RuntimeEnvSetupError, etc.). So I'm looking for a slightly more stable version to upgrade, Thank you! |
@yx367563 could you elaborate more on the issues you are facing. Have you created github issues for them? |
P0 label reflects it already -- but the issue is quite severe; it causes intermittent failures of any production job in which workers access a runtime environment. |
can reproduce - with a slow rt env agent + task pressure |
Nice! |
@yx367563 we don't have an accurate "culprit" commit. You can stay at your current version until this issue is resolved (and to the next version), or do some stress testing with your workload to confirm. |
What happened + What you expected to happen
In a node start up:
Here we have an race condition: (2) should happen-before (4) but we don't have any code to do the waiting. If the tasks are received and started before the runtime env agent is ready to receive requests, we can have runtime env set up failures.
Note the waiting is not "process started" but needs to go through a HTTP probe. This means we need:
GET /ping
which returns whatever.Versions / Dependencies
master
Reproduction script
N/A
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: