Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance degradation with multi-GPU training on newer torch/transformers #30840

Open
2 of 4 tasks
abdulfatir opened this issue May 15, 2024 · 13 comments
Open
2 of 4 tasks

Comments

@abdulfatir
Copy link

abdulfatir commented May 15, 2024

System Info

# Env 1
- `Accelerate` version: 0.30.1
- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /home/ubuntu/miniconda3/envs/train/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Transformers version: 4.40.2
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        Not found

# Env 2
- `Accelerate` version: 0.20.3
- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Transformers version: 4.30.2
- PyTorch XPU available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I am using a g5.12xlarge EC2 instance for this test but I observed this issue on other machines as well. This is just a minimum example to demonstrate the issue. In my actual usage, the degradation is even worse.

  1. Create env1 and install: pip install transformers torch accelerate.
  2. Create env2 and install: pip install transformers==4.30.2 torch==2.0.1 accelerate==0.20.3.
  3. Run the following script using torchrun --nproc-per-node=4 test.py.
from typing import Iterator
import torch
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import IterableDataset


class DummyDataset(IterableDataset):
    def __iter__(self) -> Iterator:
        while True:
            yield {
                "input_ids": torch.randint(4000, size=(512,)),
                "labels": torch.randint(4000, size=(64,)),
            }


if __name__ == "__main__":
    model = T5ForConditionalGeneration.from_pretrained("google/t5-efficient-small")
    dataset = DummyDataset()
    training_args = TrainingArguments(
        output_dir="./output/",
        max_steps=1000_000,
        per_device_train_batch_size=16,
    )
    trainer = Trainer(model=model, train_dataset=dataset, args=training_args)
    trainer.train()

Observations

  1. On env1 GPU0 utilization keeps fluctuating and the estimated training time is shown as ~82hrs.
  2. On env2 all GPUs have utilization maxed out and the estimated training time is shown as ~66hrs.

Expected behavior

Both environments should have similar training time.

@muellerzr
Copy link
Contributor

muellerzr commented May 15, 2024

Accelerate isn't the issue.

Timings based on my 2x4090:

Assume 0.x are accelerate versions
On transformers v4.30.2:

  • 0.30.1: ~28.5hrs
  • 0.29.3: ~29hrs <- We fixed this
  • 0.28.0: ~28.5hrs
  • 0.21.0: (minimum for transformers): ~28.5hrs

On transformers 4.40.2:

  • 0.30.1: ~29.5hrs
  • 0.29.3: ~29.5hrs
  • 0.28.0: ~29.5hrs

On transformers 4.30.2:

  • 0.20.3: ~28.5hrs
  • 0.30.1: ~28.5hrs

So you can see that this issue might involve the trainer, however I didn't actually see any changes here as you can tell.

In a last ditch effort:

  • torch==2.0.1, accelerate==0.30.1, transformers==4.30.2: 28.5hrs
  • torch==2.0.1, accelerate==0.20.3, transformers==4.30.2: 28.5hrs
  • torch==2.0.1, accelerate==0.30.1, transformers==4.40.2: 29.5hrs

Now we are seeing issues from transformers instead.

Narrowing it down further (assuming same torch and accelerate):

  • transformers==4.39.3: 29.5hrs
  • ...
  • transformers==4.34.1: 29.5hrs
  • ...
  • transformers==4.32.1: 29.5hrs
  • transformers==4.31.0: 28.5hrs

So the issue stems from transformers 4.32.1 + torch 2.0.1

@muellerzr
Copy link
Contributor

I'm not sure it's worth us fixing, since updating your torch version will solve this problem.

@muellerzr
Copy link
Contributor

muellerzr commented May 15, 2024

Is there a specific use-case for needing torch 2.0.1 and you can't use a later version?

@muellerzr muellerzr transferred this issue from huggingface/accelerate May 15, 2024
@muellerzr muellerzr changed the title Significant performance degradation on recent accelerate versions with multi-GPU training Significant performance degradation on recent transformers versions with multi-GPU training May 15, 2024
@muellerzr muellerzr changed the title Significant performance degradation on recent transformers versions with multi-GPU training Significant performance degradation transformers version 4.32.1 with multi-GPU training May 15, 2024
@muellerzr muellerzr changed the title Significant performance degradation transformers version 4.32.1 with multi-GPU training Significant performance degradation transformers version 4.32.1 with multi-GPU training and older torch version May 15, 2024
@muellerzr
Copy link
Contributor

Also: One thing I found could affect it by a number of an hr was the temp my GPU was at. If it was cool/a cold start it could be an hr slower. There's lots of variables at play here and what exactly is the cause of your issue I'm unsure of, even after thorough looking

@abdulfatir
Copy link
Author

abdulfatir commented May 15, 2024

@muellerzr Thanks a lot for checking. All your tests seem to be in the same ballpark, so I don't think this really reproduces the issue. Also note that the performance seems to be degrading with more number of GPUs, so 2x4090 may not be enough to reproduce it. I can run some more tests on my end, if you have suggestions.

Regarding torch version: Unfortunately, the problem (as I have described above) is that recent torch/transformers versions are actually the ones that are slow. Therefore, I cannot just upgrade them to fix the problem. In fact, I actually upgraded the libraries when I noticed the problem.

Regarding temperature: I don't think that's the issue. I have tested this on multiple machines and multiple times. Switching the env changes the runtime significantly, so I doubt that temperature is to blame here. Also, the issue here is not ~1 hr worse performance but by a factor of 2 in many cases (~6hrs vs ~12hrs or 66hrs vs 82hrs as in my example above).

@abdulfatir abdulfatir changed the title Significant performance degradation transformers version 4.32.1 with multi-GPU training and older torch version Significant performance degradation with multi-GPU training on newer torch/transformers May 15, 2024
@muellerzr
Copy link
Contributor

I’ll see if I can get access to an 8-node system to debug.

@muellerzr
Copy link
Contributor

BUT that would mean we’re hitting a ton of unnecessary distributed communications somewhere along the line (since it was working before).

@abdulfatir
Copy link
Author

abdulfatir commented May 16, 2024

I ran some tests again (all done in fresh envs):

  1. torch==2.0.1 transformers==4.30.2 accelerate==0.20.3: shows ~66hrs.
  2. torch==2.0.1 transformers==4.40.2 accelerate==0.30.1: shows ~67hrs.
  3. torch==2.1.2 transformers==4.40.2 accelerate==0.30.1: shows ~67hrs.
  4. torch==2.2.2 transformers==4.40.2 accelerate==0.30.1: shows ~82hrs.
  5. torch==2.3.0 transformers==4.40.2 accelerate==0.30.1: shows ~82hrs.

It looks like recent transformers/accelerate versions are only slightly worse when used with torch==2.0.1/torch==2.1.2 but significantly worse on torch==2.2.2 and later. Any idea on what could be going on? Is this more of a torch issue?

@muellerzr
Copy link
Contributor

Let me dig today and see if we have any torch 2.2+ import checks that could differ.

@abdulfatir
Copy link
Author

Do we have any update on this @muellerzr?

@abdulfatir
Copy link
Author

I thought maybe this has something to do with iterable vs map style datasets, so I did the following test but it's the same story.

from typing import Iterator
import torch
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import IterableDataset, Dataset


class IterableDummyDataset(IterableDataset):
    def __iter__(self) -> Iterator:
        while True:
            yield {
                "input_ids": torch.randint(4000, size=(512,)),
                "labels": torch.randint(4000, size=(64,)),
            }


class MapDummyDataset(Dataset):
    def __len__(self):
        return 1000

    def __getitem__(self, i):
        return {
            "input_ids": torch.randint(4000, size=(512,)),
            "labels": torch.randint(4000, size=(64,)),
        }


if __name__ == "__main__":
    model = T5ForConditionalGeneration.from_pretrained("google/t5-efficient-small")
    dataset = MapDummyDataset()
    training_args = TrainingArguments(
        output_dir="./output/",
        max_steps=1000_000,
        per_device_train_batch_size=16,
    )
    trainer = Trainer(model=model, train_dataset=dataset, args=training_args)
    trainer.train()

@abdulfatir
Copy link
Author

This really looks like a torch issue. I have opened an issue here: pytorch/pytorch#127077

@muellerzr
Copy link
Contributor

Thanks for finding and reproducing in torch! Will keep a close eye on it 🤗

abdulfatir added a commit to amazon-science/chronos-forecasting that referenced this issue May 27, 2024
*Description of changes:* This PR relaxes `torch` and `transformers`
versions to allow for older versions that were used during original
training. This is needed in light of recent `torch`/`transformers`
versions being slower with DDP.

Relevant issues (but the problem may be deeper than these):

- huggingface/transformers#30840
- pytorch/pytorch#127077
- NVIDIA/nccl#1298


By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

Co-authored-by: Abdul Fatir Ansari <ansarnd@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants