Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed不起作用 #3826

Closed
1 task done
N-Kingsley opened this issue May 20, 2024 · 1 comment
Closed
1 task done

deepspeed不起作用 #3826

N-Kingsley opened this issue May 20, 2024 · 1 comment
Labels
wontfix This will not be worked on

Comments

@N-Kingsley
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

examples/lora_multi_gpu/ds_zero3.sh
#!/bin/bash

NPROC_PER_NODE=4

CUDA_VISIBLE_DEVICES=0,1,5,6 python -m torch.distributed.run
--nproc_per_node $NPROC_PER_NODE
--nnodes 1
--standalone
src/train.py examples/lora_multi_gpu/baichuan2_lora_sft_ds.yaml

examples/lora_multi_gpu/baichuan2_lora_sft_ds.yaml

model

model_name_or_path: /home/nk/home/nk/eveProject_pytorch/ChatGPT/baichuan-inc/Baichuan2-7B-Chat

method

stage: sft
do_train: true
finetuning_type: lora
lora_target: W_pack

ddp

ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: identity,alpaca_gpt4_en
template: baichuan2
cutoff_len: 1024
max_samples: 1000
val_size: 0.1
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/baichuan2-7b-chat/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

eval

per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

ds_z3_config.json 没有改变
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}

4*2080ti的卡还是报显存不足,而用pytorch fsdp就能跑起来

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the wontfix This will not be worked on label May 20, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024
@N-Kingsley
Copy link
Author

lora微调目前支持deepspeed吗?
我用zero3阶段,分多张卡,按理来说,卡越多,每一张卡上占用的显存更少。
但是实际上增加卡的数量,每一张卡的显存占用不变。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants