You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043353 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043355 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043356 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 1043354) of binary: /home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/bin/python
Traceback (most recent call last):
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-05-15_10:04:22
host : whshare-agent-26
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 1043354)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 1043354
Reminder
Reproduction
bash examples/lora_multi_gpu/ds_zero3.sh
ds_zero3.sh:
#!/bin/bash
NPROC_PER_NODE=4
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run
--nproc_per_node $NPROC_PER_NODE
--nnodes 1
--standalone
src/train.py examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
llama3_lora_sft_ds.yaml:
model_name_or_path: Meta-Llama-3-8B-Instruct
stage: pt
do_train: true
finetuning_type: lora
lora_target: all
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json
dataset: wikipedia_zh_local
cutoff_len: 4096
val_size: 0.1
overwrite_cache: true
preprocessing_num_workers: 16
output_dir: saves/llama3-8b-instruct/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500
错误
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043353 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043355 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043356 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 1043354) of binary: /home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/bin/python
Traceback (most recent call last):
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-05-15_10:04:22
host : whshare-agent-26
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 1043354)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 1043354
显卡
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:01:00.0 Off | 0 |
| N/A 29C P0 35W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:02:00.0 Off | 0 |
| N/A 29C P0 34W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:81:00.0 Off | 0 |
| N/A 30C P0 33W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:82:00.0 Off | 0 |
| N/A 30C P0 33W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
Expected behavior
No response
System Info
transformers
version: 4.40.2Others
发生这个错误是什么原因呢?怎么解决?
The text was updated successfully, but these errors were encountered: