Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_partition function in lava/util/slurm.py will not set partition if first listed board by sinfo has status down #754

Open
furlong-cmu opened this issue Jul 26, 2023 · 0 comments
Labels
0-needs-review For all new issues 1-bug Something isn't working

Comments

@furlong-cmu
Copy link

furlong-cmu commented Jul 26, 2023

Describe the bug
When specifying a partition in the use_slurm_host function if there is more than one board in the partition, and the first board(s) returned by sinfo has status down, the set_partition function (line 72 of lava/util/slurm.py) will return a value error that the partition is not found or is down.

To reproduce current behavior

After applying my own fix for bug in #753
run code:

from lava.utils import loihi

loihi.use_slurm_host(partition='partition-name', loihi_gen=loihi.ChipGeneration.N3B3)

I get the error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 3
      1 from lava.utils import loihi
----> 3 loihi.use_slurm_host(partition='oheogluch', loihi_gen=loihi.ChipGeneration.N3B3)
      4 use_loihi2 = loihi.is_installed()
      6 # if use_loihi2:

File ~/lava_env/lib/python3.8/site-packages/lava/utils/loihi.py:57, in use_slurm_host(partition, board, loihi_gen)
     54 os.environ["LOIHI_GEN"] = loihi_gen.value
     56 slurm.set_board(board, partition)
---> 57 slurm.set_partition(partition)
     59 global host
     60 host = "SLURM"

File ~/lava_env/lib/python3.8/site-packages/lava/utils/slurm.py:89, in set_partition(partition)
     87 print(partition_info)
     88 if partition_info is None or "down" in partition_info.state:
---> 89     raise ValueError(
     90         f"Attempting to use SLURM for Loihi but partition {partition} "
     91         f"is not found or is down. Run sinfo to check available "
     92         f"partitions.")
     94 os.environ["PARTITION"] = partition

ValueError: Attempting to use SLURM for Loihi but partition oheogluch is not found or is down. Run sinfo to check available partitions.

Expected behavior
The expected behaviour is to update the os.environ['PARTITION'] variable to reflect the selected partition.

Environment (please complete the following information):

  • Device: Intel cloud
  • OS: Linux
  • Lava version 0.8.0

Additional Context
Temporarily fixed this by changing line 88 of lava/util/slurm.py to ignore the "down" partition state.
when I run sinfo this seems to occur when the first listed board for the partition has a status "down" even though other boards have status idle.

Possibly symmetric problem in setting boards?

@furlong-cmu furlong-cmu added the 1-bug Something isn't working label Jul 26, 2023
@github-actions github-actions bot added the 0-needs-review For all new issues label Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0-needs-review For all new issues 1-bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant