Torch 2.3 breaks DDP? #2527

TParcollet · 2024-04-26T14:43:10Z

Describe the bug

Starting any recipe with two or + processes will fail using DDP as the run_on_main function does not catch (no barrier?) the other processes.

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

Expected behaviour

Well, the barrier stops the other processes.

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

No response

Adel-Moumen · 2024-04-29T13:28:28Z

I'm unable to reproduce your issue.

I tried on a two nodes settings with the following example:

    import os 
    if os.environ['RANK'] == '0':
        import time
        # wait for 2s
        print("Rank 0 is sleeping")
        time.sleep(5)
        print("Rank 0 finished sleeping")
        print("Rank 0 is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print("Rank 0 is done")
    else:
        print(f"Rank {os.environ['RANK']} is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print(f"Rank {os.environ['RANK']} is done")

And got the following output:

Rank 1 is ready for barrier
Rank 0 is sleeping
Rank 0 finished sleeping
Rank 0 is ready for barrier
Rank 0 is done
Rank 1 is done

I also tried with a recipe:

    from librispeech_prepare import prepare_librispeech  # noqa

    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

    sb.utils.distributed.ddp_barrier() 

    if os.environ['RANK'] == '0':
        print("*" * 10)
    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

And got:

BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1
**********
BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1

I also tried to put multiple barrier and also got the expected results.

TParcollet · 2024-04-29T17:50:57Z

Mystery / 20, I'll investigate once done with my current duties at SAIC.

asumagic · 2024-05-16T11:45:17Z

Can't repro on Jean Zay with 4x A100 on DDP either with PyTorch 2.3.0

TParcollet added the bug Something isn't working label Apr 26, 2024

TParcollet assigned pplantinga, asumagic, TParcollet and Adel-Moumen Apr 26, 2024

TParcollet added help wanted Extra attention is needed important labels Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch 2.3 breaks DDP? #2527

Torch 2.3 breaks DDP? #2527

TParcollet commented Apr 26, 2024 •

edited

Adel-Moumen commented Apr 29, 2024

TParcollet commented Apr 29, 2024

asumagic commented May 16, 2024

Torch 2.3 breaks DDP? #2527

Torch 2.3 breaks DDP? #2527

Comments

TParcollet commented Apr 26, 2024 • edited

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

Adel-Moumen commented Apr 29, 2024

TParcollet commented Apr 29, 2024

asumagic commented May 16, 2024

TParcollet commented Apr 26, 2024 •

edited