You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe. #2679
As previously mentioned in this issue, the existing launching mechanism requires password-less SSH. We did not prefer this at Kensho Technologies, as our current multi-node training framework uses a launching mechanism similar to torchrun.
Instead of a launcher node ssh-ing the command to the workers, torchrun works by providing a master address/port, and a node rank for each worker. By bypassing SSH and using deepspeed directly like torchrun, we can seamlessly integrate DeepSpeed to our existing setup, instead of having two different launching topologies.
Describe the solution you'd like
In a private fork of DeepSpeed, we were able to get training working without using SSH. To do this, we added a flag to the launcher-runner called --no_ssh, which also depends on a --node_rank flag to be provided.
Then, in the runner, the command is ran as if multi_node_exec is disabled. We have verified that this method works.
Describe alternatives you've considered
As mentioned, we considered setting up two topologies based on the framework used. For example, GPT-NeoX uses the deepspeed launcher, therefore we would need the SSH setup. However, MosaicML's llm-foundry works by independently running the command on each worker (similar to torchrun). We didn't want to create two architectures depending on which framework was being used for training.
Additional context
If deemed useful by the project maintainers, we can make a PR, with S&P Global/Kensho Technologies as the contributing entity.
The text was updated successfully, but these errors were encountered:
Do I need to be given permissions? I'm trying to push my local branch dogacancolak/no-ssh-launcher
$ git push
ERROR: Permission to microsoft/DeepSpeed.git denied to dogacancolak-kensho.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Is your feature request related to a problem? Please describe.
#2679
As previously mentioned in this issue, the existing launching mechanism requires password-less SSH. We did not prefer this at Kensho Technologies, as our current multi-node training framework uses a launching mechanism similar to
torchrun
.Instead of a launcher node ssh-ing the command to the workers,
torchrun
works by providing a master address/port, and a node rank for each worker. By bypassing SSH and using deepspeed directly like torchrun, we can seamlessly integrate DeepSpeed to our existing setup, instead of having two different launching topologies.Describe the solution you'd like
In a private fork of DeepSpeed, we were able to get training working without using SSH. To do this, we added a flag to the launcher-runner called
--no_ssh
, which also depends on a--node_rank
flag to be provided.Then, in the runner, the command is ran as if
multi_node_exec
is disabled. We have verified that this method works.Describe alternatives you've considered
As mentioned, we considered setting up two topologies based on the framework used. For example, GPT-NeoX uses the deepspeed launcher, therefore we would need the SSH setup. However, MosaicML's llm-foundry works by independently running the command on each worker (similar to torchrun). We didn't want to create two architectures depending on which framework was being used for training.
Additional context
If deemed useful by the project maintainers, we can make a PR, with S&P Global/Kensho Technologies as the contributing entity.
The text was updated successfully, but these errors were encountered: