Significant performance degradation with multi-GPU training on newer torch/transformers #30840

abdulfatir · 2024-05-15T18:52:12Z

System Info

# Env 1
- `Accelerate` version: 0.30.1
- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /home/ubuntu/miniconda3/envs/train/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Transformers version: 4.40.2
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        Not found

# Env 2
- `Accelerate` version: 0.20.3
- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Transformers version: 4.30.2
- PyTorch XPU available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I am using a g5.12xlarge EC2 instance for this test but I observed this issue on other machines as well. This is just a minimum example to demonstrate the issue. In my actual usage, the degradation is even worse.

Create env1 and install: pip install transformers torch accelerate.
Create env2 and install: pip install transformers==4.30.2 torch==2.0.1 accelerate==0.20.3.
Run the following script using torchrun --nproc-per-node=4 test.py.

from typing import Iterator
import torch
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import IterableDataset


class DummyDataset(IterableDataset):
    def __iter__(self) -> Iterator:
        while True:
            yield {
                "input_ids": torch.randint(4000, size=(512,)),
                "labels": torch.randint(4000, size=(64,)),
            }


if __name__ == "__main__":
    model = T5ForConditionalGeneration.from_pretrained("google/t5-efficient-small")
    dataset = DummyDataset()
    training_args = TrainingArguments(
        output_dir="./output/",
        max_steps=1000_000,
        per_device_train_batch_size=16,
    )
    trainer = Trainer(model=model, train_dataset=dataset, args=training_args)
    trainer.train()

Observations

On env1 GPU0 utilization keeps fluctuating and the estimated training time is shown as ~82hrs.
On env2 all GPUs have utilization maxed out and the estimated training time is shown as ~66hrs.

Expected behavior

Both environments should have similar training time.

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-05-15T19:34:22Z

Accelerate isn't the issue.

Timings based on my 2x4090:

Assume 0.x are accelerate versions
On transformers v4.30.2:

0.30.1: ~28.5hrs
0.29.3: ~29hrs <- We fixed this
0.28.0: ~28.5hrs
0.21.0: (minimum for transformers): ~28.5hrs

On transformers 4.40.2:

0.30.1: ~29.5hrs
0.29.3: ~29.5hrs
0.28.0: ~29.5hrs

On transformers 4.30.2:

0.20.3: ~28.5hrs
0.30.1: ~28.5hrs

So you can see that this issue might involve the trainer, however I didn't actually see any changes here as you can tell.

In a last ditch effort:

torch==2.0.1, accelerate==0.30.1, transformers==4.30.2: 28.5hrs
torch==2.0.1, accelerate==0.20.3, transformers==4.30.2: 28.5hrs
torch==2.0.1, accelerate==0.30.1, transformers==4.40.2: 29.5hrs

Now we are seeing issues from transformers instead.

Narrowing it down further (assuming same torch and accelerate):

transformers==4.39.3: 29.5hrs
...
transformers==4.34.1: 29.5hrs
...
transformers==4.32.1: 29.5hrs
transformers==4.31.0: 28.5hrs

So the issue stems from transformers 4.32.1 + torch 2.0.1

muellerzr · 2024-05-15T19:37:53Z

I'm not sure it's worth us fixing, since updating your torch version will solve this problem.

muellerzr · 2024-05-15T19:39:20Z

Is there a specific use-case for needing torch 2.0.1 and you can't use a later version?

muellerzr · 2024-05-15T19:58:37Z

Also: One thing I found could affect it by a number of an hr was the temp my GPU was at. If it was cool/a cold start it could be an hr slower. There's lots of variables at play here and what exactly is the cause of your issue I'm unsure of, even after thorough looking

abdulfatir · 2024-05-15T21:07:03Z

@muellerzr Thanks a lot for checking. All your tests seem to be in the same ballpark, so I don't think this really reproduces the issue. Also note that the performance seems to be degrading with more number of GPUs, so 2x4090 may not be enough to reproduce it. I can run some more tests on my end, if you have suggestions.

Regarding torch version: Unfortunately, the problem (as I have described above) is that recent torch/transformers versions are actually the ones that are slow. Therefore, I cannot just upgrade them to fix the problem. In fact, I actually upgraded the libraries when I noticed the problem.

Regarding temperature: I don't think that's the issue. I have tested this on multiple machines and multiple times. Switching the env changes the runtime significantly, so I doubt that temperature is to blame here. Also, the issue here is not ~1 hr worse performance but by a factor of 2 in many cases (~6hrs vs ~12hrs or 66hrs vs 82hrs as in my example above).

muellerzr · 2024-05-15T21:49:01Z

I’ll see if I can get access to an 8-node system to debug.

muellerzr · 2024-05-15T22:08:26Z

BUT that would mean we’re hitting a ton of unnecessary distributed communications somewhere along the line (since it was working before).

abdulfatir · 2024-05-16T13:18:00Z

I ran some tests again (all done in fresh envs):

torch==2.0.1 transformers==4.30.2 accelerate==0.20.3: shows ~66hrs.
torch==2.0.1 transformers==4.40.2 accelerate==0.30.1: shows ~67hrs.
torch==2.1.2 transformers==4.40.2 accelerate==0.30.1: shows ~67hrs.
torch==2.2.2 transformers==4.40.2 accelerate==0.30.1: shows ~82hrs.
torch==2.3.0 transformers==4.40.2 accelerate==0.30.1: shows ~82hrs.

It looks like recent transformers/accelerate versions are only slightly worse when used with torch==2.0.1/torch==2.1.2 but significantly worse on torch==2.2.2 and later. Any idea on what could be going on? Is this more of a torch issue?

muellerzr · 2024-05-16T14:26:47Z

Let me dig today and see if we have any torch 2.2+ import checks that could differ.

abdulfatir · 2024-05-23T08:59:29Z

Do we have any update on this @muellerzr?

abdulfatir · 2024-05-24T09:03:12Z

I thought maybe this has something to do with iterable vs map style datasets, so I did the following test but it's the same story.

from typing import Iterator
import torch
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import IterableDataset, Dataset


class IterableDummyDataset(IterableDataset):
    def __iter__(self) -> Iterator:
        while True:
            yield {
                "input_ids": torch.randint(4000, size=(512,)),
                "labels": torch.randint(4000, size=(64,)),
            }


class MapDummyDataset(Dataset):
    def __len__(self):
        return 1000

    def __getitem__(self, i):
        return {
            "input_ids": torch.randint(4000, size=(512,)),
            "labels": torch.randint(4000, size=(64,)),
        }


if __name__ == "__main__":
    model = T5ForConditionalGeneration.from_pretrained("google/t5-efficient-small")
    dataset = MapDummyDataset()
    training_args = TrainingArguments(
        output_dir="./output/",
        max_steps=1000_000,
        per_device_train_batch_size=16,
    )
    trainer = Trainer(model=model, train_dataset=dataset, args=training_args)
    trainer.train()

abdulfatir · 2024-05-24T11:20:43Z

This really looks like a torch issue. I have opened an issue here: pytorch/pytorch#127077

muellerzr · 2024-05-24T11:46:44Z

Thanks for finding and reproducing in torch! Will keep a close eye on it 🤗

*Description of changes:* This PR relaxes `torch` and `transformers` versions to allow for older versions that were used during original training. This is needed in light of recent `torch`/`transformers` versions being slower with DDP. Relevant issues (but the problem may be deeper than these): - huggingface/transformers#30840 - pytorch/pytorch#127077 - NVIDIA/nccl#1298 By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. Co-authored-by: Abdul Fatir Ansari <ansarnd@amazon.com>

muellerzr transferred this issue from huggingface/accelerate May 15, 2024

muellerzr changed the title ~~Significant performance degradation on recent accelerate versions with multi-GPU training~~ Significant performance degradation on recent transformers versions with multi-GPU training May 15, 2024

muellerzr changed the title ~~Significant performance degradation on recent transformers versions with multi-GPU training~~ Significant performance degradation transformers version 4.32.1 with multi-GPU training May 15, 2024

muellerzr changed the title ~~Significant performance degradation transformers version 4.32.1 with multi-GPU training~~ Significant performance degradation transformers version 4.32.1 with multi-GPU training and older torch version May 15, 2024

abdulfatir changed the title ~~Significant performance degradation transformers version 4.32.1 with multi-GPU training and older torch version~~ Significant performance degradation with multi-GPU training on newer torch/transformers May 15, 2024

abdulfatir mentioned this issue May 24, 2024

Significant peformance regression in DDP for torch > 2.1 pytorch/pytorch#127077

Open

abdulfatir mentioned this issue May 27, 2024

Relax torch and transformers versions amazon-science/chronos-forecasting#81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance degradation with multi-GPU training on newer torch/transformers #30840

Significant performance degradation with multi-GPU training on newer torch/transformers #30840

abdulfatir commented May 15, 2024 •

edited

muellerzr commented May 15, 2024 •

edited

muellerzr commented May 15, 2024

muellerzr commented May 15, 2024 •

edited

muellerzr commented May 15, 2024

abdulfatir commented May 15, 2024 •

edited

muellerzr commented May 15, 2024

muellerzr commented May 15, 2024

abdulfatir commented May 16, 2024 •

edited

muellerzr commented May 16, 2024

abdulfatir commented May 23, 2024

abdulfatir commented May 24, 2024

abdulfatir commented May 24, 2024

muellerzr commented May 24, 2024

Significant performance degradation with multi-GPU training on newer torch/transformers #30840

Significant performance degradation with multi-GPU training on newer torch/transformers #30840

Comments

abdulfatir commented May 15, 2024 • edited

System Info

Information

Tasks

Reproduction

Observations

Expected behavior

muellerzr commented May 15, 2024 • edited

muellerzr commented May 15, 2024

muellerzr commented May 15, 2024 • edited

muellerzr commented May 15, 2024

abdulfatir commented May 15, 2024 • edited

muellerzr commented May 15, 2024

muellerzr commented May 15, 2024

abdulfatir commented May 16, 2024 • edited

muellerzr commented May 16, 2024

abdulfatir commented May 23, 2024

abdulfatir commented May 24, 2024

abdulfatir commented May 24, 2024

muellerzr commented May 24, 2024

abdulfatir commented May 15, 2024 •

edited

muellerzr commented May 15, 2024 •

edited

muellerzr commented May 15, 2024 •

edited

abdulfatir commented May 15, 2024 •

edited

abdulfatir commented May 16, 2024 •

edited