-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch 2.3 breaks DDP? #2527
Labels
Comments
I'm unable to reproduce your issue. I tried on a two nodes settings with the following example: import os
if os.environ['RANK'] == '0':
import time
# wait for 2s
print("Rank 0 is sleeping")
time.sleep(5)
print("Rank 0 finished sleeping")
print("Rank 0 is ready for barrier")
sb.utils.distributed.ddp_barrier()
print("Rank 0 is done")
else:
print(f"Rank {os.environ['RANK']} is ready for barrier")
sb.utils.distributed.ddp_barrier()
print(f"Rank {os.environ['RANK']} is done") And got the following output:
I also tried with a recipe: from librispeech_prepare import prepare_librispeech # noqa
print("BEFORE")
print("RANK", os.environ["RANK"])
# multi-gpu (ddp) save data preparation
run_on_main(
prepare_librispeech,
kwargs={
"data_folder": hparams["data_folder"],
"tr_splits": hparams["train_splits"],
"dev_splits": hparams["dev_splits"],
"te_splits": hparams["test_splits"],
"save_folder": hparams["output_folder"],
"merge_lst": hparams["train_splits"],
"merge_name": "train.csv",
"skip_prep": hparams["skip_prep"],
},
)
print("AFTER")
print("RANK", os.environ["RANK"])
sb.utils.distributed.ddp_barrier()
if os.environ['RANK'] == '0':
print("*" * 10)
print("BEFORE")
print("RANK", os.environ["RANK"])
# multi-gpu (ddp) save data preparation
run_on_main(
prepare_librispeech,
kwargs={
"data_folder": hparams["data_folder"],
"tr_splits": hparams["train_splits"],
"dev_splits": hparams["dev_splits"],
"te_splits": hparams["test_splits"],
"save_folder": hparams["output_folder"],
"merge_lst": hparams["train_splits"],
"merge_name": "train.csv",
"skip_prep": hparams["skip_prep"],
},
)
print("AFTER")
print("RANK", os.environ["RANK"]) And got:
I also tried to put multiple barrier and also got the expected results. |
Mystery / 20, I'll investigate once done with my current duties at SAIC. |
Can't repro on Jean Zay with 4x A100 on DDP either with PyTorch 2.3.0 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Starting any recipe with two or + processes will fail using DDP as the run_on_main function does not catch (no barrier?) the other processes.
From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.
Expected behaviour
Well, the barrier stops the other processes.
To Reproduce
No response
Environment Details
No response
Relevant Log Output
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: