Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Allow nonexistent cloud in candidate resources and speed up optimization #3567

Merged
merged 11 commits into from
May 20, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented May 20, 2024

Closes #3565

Allow nonexistent cloud

With at least one cloud enabled

Without kubernetes and IBM enabled:

resources:
  accelerators: A100:8
  ordered:
    - cloud: kubernetes
    - cloud: aws
      use_spot: true
    - cloud: ibm

Master:

Task from YAML spec: task.yaml
I 05-20 07:20:28 optimizer.py:983] Using user-specified accelerators list (will be tried in the listed order): Kubernetes({'A100': 8}), AWS([Spot], {'A100': 8}), IBM({'A100': 8})
INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
sky.exceptions.ResourcesUnavailableError: Task requires Kubernetes which is not enabled: Task(run=<empty>)
  resources: Kubernetes({'A100': 8}).
To enable access, run sky check , or change the cloud requirement

This PR:

Task from YAML spec: task.yaml
W 05-20 07:41:11 optimizer.py:1189] Task requires IBM, Kubernetes which are not enabled. To enable access, change the task cloud requirement or run: sky check IBM Kubernetes
I 05-20 07:41:12 optimizer.py:696] == Optimizer ==
I 05-20 07:41:12 optimizer.py:707] Target: minimizing cost
I 05-20 07:41:12 optimizer.py:719] Estimated cost: $5.9 / hour
I 05-20 07:41:12 optimizer.py:719] 
I 05-20 07:41:12 optimizer.py:844] Considered resources (1 node):
I 05-20 07:41:12 optimizer.py:914] -------------------------------------------------------------------------------------------------
I 05-20 07:41:12 optimizer.py:914]  CLOUD   INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 05-20 07:41:12 optimizer.py:914] -------------------------------------------------------------------------------------------------
I 05-20 07:41:12 optimizer.py:914]  AWS     p4d.24xlarge[Spot]   96      1152      A100:8         us-east-2b    5.89          ✔     
I 05-20 07:41:12 optimizer.py:914] -------------------------------------------------------------------------------------------------
I 05-20 07:41:12 optimizer.py:914] 
Launching a new cluster 'sky-1276-azureuser'. Proceed? [Y/n]:

For a dag:

resources:
  accelerators: A100:8
  ordered:
    - cloud: kubernetes
    - cloud: aws
      use_spot: true
    - cloud: ibm


run: hi

---
resources:
  accelerators: h100:8
  ordered:
    - cloud: kubernetes
    - cloud: aws
      use_spot: true
    - cloud: gcp


run: hi
Task from YAML spec: task.yaml
Managed job 'sky-1be1-azureuser' will be launched on (estimated):
W 05-20 07:47:36 optimizer.py:1189] Task 'sky-1be1-azureuser-0' requires IBM, Kubernetes which are not enabled. To enable access, change the task cloud requirement or run: sky check IBM Kubernetes
W 05-20 07:47:36 optimizer.py:1189] Task 'sky-1be1-azureuser-1' requires Kubernetes which is not enabled. To enable access, change the task cloud requirement or run: sky check Kubernetes
I 05-20 07:47:36 optimizer.py:978] 1st task is using user-specified accelerators list (will be tried in the listed order): Kubernetes({'A100': 8}), AWS([Spot], {'A100': 8}), IBM({'A100': 8})
I 05-20 07:47:36 optimizer.py:978] 2nd task is using user-specified accelerators list (will be tried in the listed order): Kubernetes({'H100': 8}), AWS([Spot], {'H100': 8}), GCP({'H100': 8})
I 05-20 07:47:38 optimizer.py:696] == Optimizer ==
I 05-20 07:47:38 optimizer.py:707] Target: minimizing cost
I 05-20 07:47:38 optimizer.py:722] Estimated total runtime: 2.0 hours
I 05-20 07:47:38 optimizer.py:722] Estimated total cost: $37.3
I 05-20 07:47:38 optimizer.py:722] 
I 05-20 07:47:38 optimizer.py:820] Best plan: 
I 05-20 07:47:38 optimizer.py:825] -------------------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:825]  TASK                   #NODES   CLOUD   INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   
I 05-20 07:47:38 optimizer.py:825] -------------------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:825]  sky-1be1-azureuser-0   1        AWS     p4d.24xlarge[Spot]   96      1152      A100:8         us-east-2b    
I 05-20 07:47:38 optimizer.py:825]  sky-1be1-azureuser-1   1        AWS     p5.48xlarge[Spot]    192     2048      H100:8         us-east-2b    
I 05-20 07:47:38 optimizer.py:825] -------------------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:825] 
I 05-20 07:47:38 optimizer.py:844] Considered resources for task 'sky-1be1-azureuser-0' (1 node):
I 05-20 07:47:38 optimizer.py:914] -------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:914]  CLOUD   INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 05-20 07:47:38 optimizer.py:914] -------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:914]  AWS     p4d.24xlarge[Spot]   96      1152      A100:8         us-east-2b    5.89          ✔     
I 05-20 07:47:38 optimizer.py:914] -------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:914] 
I 05-20 07:47:38 optimizer.py:844] Considered resources for task 'sky-1be1-azureuser-1' (1 node):
I 05-20 07:47:38 optimizer.py:914] --------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:914]  CLOUD   INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
I 05-20 07:47:38 optimizer.py:914] --------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:914]  GCP     a3-highgpu-8g       208     1872      H100:8         us-central1-a   87.83               
I 05-20 07:47:38 optimizer.py:914]  AWS     p5.48xlarge[Spot]   192     2048      H100:8         us-east-2b      31.44         ✔     
I 05-20 07:47:38 optimizer.py:914] --------------------------------------------------------------------------------------------------
I 05-20 07:47:38 optimizer.py:914] 
Launching a managed job 'sky-1be1-azureuser'. Proceed? [Y/n]: 

With no cloud enabled

resources:
  accelerators: A100:8
  ordered:
    - cloud: kubernetes
    - cloud: IBM
Task from YAML spec: task.yaml
sky.exceptions.ResourcesUnavailableError: Task requires IBM, Kubernetes which are not enabled. To enable access, change the task cloud requirement or run: sky check IBM Kubernetes

For a dag:

resources:
  accelerators: A100:8
  ordered:
    - cloud: kubernetes
    - cloud: ibm


run: hi

---
resources:
  accelerators: h100:8
  ordered:
    - cloud: kubernetes
    - cloud: aws
      use_spot: true
    - cloud: gcp


run: hi
$ sky jobs launch task.yaml
Task from YAML spec: task.yaml
Managed job 'sky-7f4b-azureuser' will be launched on (estimated):
sky.exceptions.ResourcesUnavailableError: Task 'sky-7f4b-azureuser-0' requires IBM, Kubernetes which are not enabled. To enable access, change the task cloud requirement or run: sky check IBM Kubernetes

Speed up optimization

resources:
  accelerators: A100:8
  ordered:
    - cloud: kubernetes
    - cloud: aws
      use_spot: true

time sky launch task.yaml
master: 12.4s
this PR: 1.99s

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • scripts above with ordered and any_of
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll marked this pull request as draft May 20, 2024 07:28
@Michaelvll Michaelvll marked this pull request as ready for review May 20, 2024 07:53
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for the fix! It looks mostly good to me. Left several nits and a discussion on multiple sky.check in a dag pipeline ; )

sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated
is_or_are = 'is' if len(rechecked_but_disabled_clouds) == 1 else 'are'
task_name = f' {task.name!r}' if task.name is not None else ''
msg = (
f'Task{task_name} requires '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f'Task{task_name} requires '
f'Task{task_name} requires one of '

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not actually one of but requested all of them : )

sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
Michaelvll and others added 4 commits May 20, 2024 09:47
Co-authored-by: Tian Xia <cblmemo@gmail.com>
…om:skypilot-org/skypilot into allow-unexist-cloud-in-candidate-resources
@Michaelvll
Copy link
Collaborator Author

Thanks for the review @cblmemo! PTAL @romilbhardwaj @cblmemo.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Works nicely!

@@ -120,6 +120,8 @@ def optimize(dag: 'dag_lib.Dag',
for a task.
exceptions.NoCloudAccessError: if no public clouds are enabled.
"""
_check_specified_clouds(dag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - Can we add a quick comment why we call this method and the fact that it will run sky check on the clouds specified in resources?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively, docstr for _check_specified_clouds would be useful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a docstr for the function. Thanks!

Comment on lines +1173 to +1174
enabled_clouds = sky_check.get_cached_enabled_clouds_or_refresh(
raise_if_no_cloud_access=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can defer to another PR - maybe we can update sky_check.check in the future to return a list of enabled and disabled clouds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Filed an issue for this #3570. Would prefer to leave it to the future, as the semantic for the return value should be discussed when clouds argument is passed.

@Michaelvll Michaelvll merged commit d259ddc into master May 20, 2024
20 checks passed
@Michaelvll Michaelvll deleted the allow-unexist-cloud-in-candidate-resources branch May 20, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Certain inputs to sky launch would trigger sky check, making it slow
3 participants