[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training #5539

Coobiw · 2024-05-15T18:52:08Z

Describe the bug
Hi, I use zero-3 for MLLM training. After one-epoch training stage, I want to evaluate this model(using model.generate()). However, params of the model are located on multi-gpu, lacking of gather.

If not gathering params, during evaluation(generation), error will be raised because the forward process like:

image_embeds += self.pos_embed

RuntimeError: The size of tensor a (1152) must match the size of tensor b (0) at non-singleton dimension 2

How can I gather params on every gpu for paralized evaluation(inference/generation), liking using deepspeed.zero.GatheredParameters? And after evaluation, how can I shard the model parameters again for next training epoch?

Thanks for your reply!

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

tjruwase · 2024-05-16T22:56:37Z

@Coobiw, you can you use the GatheredParameters context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of computing moving average of parameters here.

Coobiw · 2024-05-17T07:41:01Z

Hi, I've tried this before. But the program is stuck. How can I debug this?

And I want to know whether it is because I use 30B+ LLM and zero3 inference is very slow?

if self.zero_stage == 3:
                params_to_fetch = [
                    p for p in self.model.parameters()
                    if hasattr(p, 'ds_id') and p.ds_status == deepspeed.zero.partition_parameters.ZeroParamStatus.NOT_AVAILABLE
                ]
                should_gather_param = len(params_to_fetch) > 0
                with deepspeed.zero.GatheredParameters(params_to_fetch, enabled=should_gather_param):
                    self.model.eval()
                    evaluation() # contain model.generate()

tjruwase · 2024-05-17T13:28:49Z

@Coobiw, can you share your full script to help us repro on our side?

Is this a dense or MoE model?

In terms of debugging, can you use prints to pin-point the hang point?

Also, can you try to repro on single gpu so that you can use pdb for debugging. You can try two options for this:

Enable cpu/nvme offloading to fit the model, or
Use smaller model

Coobiw · 2024-05-17T14:48:42Z

Sorry, it is inconvenient to share the whole code. I would try my best to provide more information. It is a dense model. I've tried the script on my ~9B model on A100 80GB. Similar stuck appeared.

I think it may be a multi-gpu communication problem? No explicit bug. Only a warning in model.generate which is related with NCCL.

/root/miniconda3/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
t-20240517175036-k966t-worker-0:5136:5282 [7] ib_plugin.c:798 NCCL WARN NET/IB : req 0/1 tag 7 peer 172.25.40.117<36987> collective mismatch error, local size 897024 remote size 614400
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO transport/net.cc:990 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:679 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:858 -> 5 [Proxy Thread]

I guess collective mismatch error, local size 897024 remote size 614400 causes the stuck.

Additionally, my env is as following:

deepspeed == 0.14.0
cuda: 11.8

The output of nvcc -V is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Coobiw · 2024-05-17T17:49:08Z

after double check, I find another error message on one worker. as following（time-out error probably）:

[E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
t-20240517230118-grg2t-worker-1:5123:5271 [0] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5125:5269 [2] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5127:5272 [4] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5129:5270 [6] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5275 [7] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5206 [0] NCCL INFO comm 0x738ea950 rank 15 nranks 64 cudaDev 7 busId e4000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'

Coobiw · 2024-05-18T11:12:38Z

hi, I also test this in one node(8 x A100) with one 9B model. Stuck appeared. TAT

tjruwase · 2024-05-20T14:50:25Z

Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm

Coobiw · 2024-05-20T14:53:50Z

Oh, thanks, I get it. Do you have any suggestion about this? I think I've done left-padding. How to ensure the output length?

tjruwase · 2024-05-20T15:06:14Z

@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?

Coobiw added bug Something isn't working training labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training #5539

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training #5539

Coobiw commented May 15, 2024

tjruwase commented May 16, 2024

Coobiw commented May 17, 2024

tjruwase commented May 17, 2024

Coobiw commented May 17, 2024 •

edited

Coobiw commented May 17, 2024 •

edited

Coobiw commented May 18, 2024

tjruwase commented May 20, 2024

Coobiw commented May 20, 2024

tjruwase commented May 20, 2024

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training #5539

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training #5539

Comments

Coobiw commented May 15, 2024

tjruwase commented May 16, 2024

Coobiw commented May 17, 2024

tjruwase commented May 17, 2024

Coobiw commented May 17, 2024 • edited

Coobiw commented May 17, 2024 • edited

Coobiw commented May 18, 2024

tjruwase commented May 20, 2024

Coobiw commented May 20, 2024

tjruwase commented May 20, 2024

Coobiw commented May 17, 2024 •

edited

Coobiw commented May 17, 2024 •

edited