Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch 2.3 breaks DDP? #2527

Open
TParcollet opened this issue Apr 26, 2024 · 3 comments
Open

Torch 2.3 breaks DDP? #2527

TParcollet opened this issue Apr 26, 2024 · 3 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed important

Comments

@TParcollet
Copy link
Collaborator

TParcollet commented Apr 26, 2024

Describe the bug

Starting any recipe with two or + processes will fail using DDP as the run_on_main function does not catch (no barrier?) the other processes.

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

Expected behaviour

Well, the barrier stops the other processes.

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

No response

@TParcollet TParcollet added the bug Something isn't working label Apr 26, 2024
@TParcollet TParcollet added help wanted Extra attention is needed important labels Apr 26, 2024
@Adel-Moumen
Copy link
Collaborator

I'm unable to reproduce your issue.

I tried on a two nodes settings with the following example:

    import os 
    if os.environ['RANK'] == '0':
        import time
        # wait for 2s
        print("Rank 0 is sleeping")
        time.sleep(5)
        print("Rank 0 finished sleeping")
        print("Rank 0 is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print("Rank 0 is done")
    else:
        print(f"Rank {os.environ['RANK']} is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print(f"Rank {os.environ['RANK']} is done")

And got the following output:

Rank 1 is ready for barrier
Rank 0 is sleeping
Rank 0 finished sleeping
Rank 0 is ready for barrier
Rank 0 is done
Rank 1 is done

I also tried with a recipe:

    from librispeech_prepare import prepare_librispeech  # noqa

    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

    sb.utils.distributed.ddp_barrier() 

    if os.environ['RANK'] == '0':
        print("*" * 10)
    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

And got:

BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1
**********
BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1

I also tried to put multiple barrier and also got the expected results.

@TParcollet
Copy link
Collaborator Author

Mystery / 20, I'll investigate once done with my current duties at SAIC.

@asumagic
Copy link
Collaborator

Can't repro on Jean Zay with 4x A100 on DDP either with PyTorch 2.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed important
Projects
None yet
Development

No branches or pull requests

4 participants