We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchrun --nproc_per_node ${num_gpu_per_node} --master_port $MASTER_PORT --master_addr $MASTER_ADDR --node_rank $RANK --nnodes $WORLD_SIZE src/train.py --stage sft --model_name_or_path /mnt/cluster/zhangfan/models/01-ai/Yi-1.5-9B-Chat --do_train --do_eval --dataset user_sft_prompt_0516_train_classfiy --template yi --finetuning_type full --output_dir /mnt/cluster/test --preprocessing_num_workers 60 --dataloader_num_workers 60 --val_size 0.03 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --gradient_checkpointing true --cutoff_len 17000 --max_new_tokens 1000 --max_length 16000 --deepspeed examples/deepspeed/ds_z2_config.json --logging_strategy steps --logging_first_step --logging_steps 1 --save_strategy epoch --evaluation_strategy steps --eval_steps 50 --num_train_epochs 10 --lr_scheduler_type cosine --learning_rate 1e-5 --flash_attn auto --plot_loss --bf16 --rope_scaling linear --save_on_each_node false --neftune_noise_alpha 5
训练日志打印不完整,而且只打印了前900步
{"current_steps": 1, "total_steps": 850, "loss": 3.9585, "learning_rate": 9.999965849158597e-06, "epoch": 0.005871559633027523, "percentage": 0.12, "elapsed_time": "0:03:07", "remaining_time": "1 day, 20:18:20"} {"current_steps": 10, "total_steps": 850, "loss": 0.7958, "learning_rate": 9.996585300715117e-06, "epoch": 0.05871559633027523, "percentage": 1.18, "elapsed_time": "0:28:37", "remaining_time": "1 day, 16:04:13"} {"current_steps": 20, "total_steps": 850, "loss": 0.2891, "learning_rate": 9.98634586692894e-06, "epoch": 0.11743119266055047, "percentage": 2.35, "elapsed_time": "0:59:04", "remaining_time": "1 day, 16:51:17"} {"current_steps": 30, "total_steps": 850, "loss": 0.2419, "learning_rate": 9.96929568447637e-06, "epoch": 0.1761467889908257, "percentage": 3.53, "elapsed_time": "1:27:57", "remaining_time": "1 day, 16:03:59"} {"current_steps": 40, "total_steps": 850, "loss": 0.2254, "learning_rate": 9.945458041855732e-06, "epoch": 0.23486238532110093, "percentage": 4.71, "elapsed_time": "1:57:32", "remaining_time": "1 day, 15:40:04"} {"current_steps": 50, "total_steps": 850, "loss": 0.213, "learning_rate": 9.91486549841951e-06, "epoch": 0.29357798165137616, "percentage": 5.88, "elapsed_time": "2:26:05", "remaining_time": "1 day, 14:57:27"} {"current_steps": 50, "total_steps": 850, "eval_loss": 0.20849353075027466, "epoch": 0.29357798165137616, "percentage": 5.88, "elapsed_time": "2:29:16", "remaining_time": "1 day, 15:48:20"} {"current_steps": 860, "total_steps": 1700, "loss": 0.1414, "learning_rate": 4.907605475204352e-06, "epoch": 5.058715596330275, "percentage": 50.59, "elapsed_time": "0:28:50", "remaining_time": "0:28:10"} {"current_steps": 870, "total_steps": 1700, "loss": 0.144, "learning_rate": 4.815242503054277e-06, "epoch": 5.1174311926605505, "percentage": 51.18, "elapsed_time": "0:57:42", "remaining_time": "0:55:02"} {"current_steps": 880, "total_steps": 1700, "loss": 0.1466, "learning_rate": 4.7229426254201504e-06, "epoch": 5.176146788990826, "percentage": 51.76, "elapsed_time": "1:26:22", "remaining_time": "1:20:28"} {"current_steps": 890, "total_steps": 1700, "loss": 0.1455, "learning_rate": 4.630737362625631e-06, "epoch": 5.234862385321101, "percentage": 52.35, "elapsed_time": "1:54:46", "remaining_time": "1:44:27"} {"current_steps": 900, "total_steps": 1700, "loss": 0.1445, "learning_rate": 4.53865820268349e-06, "epoch": 5.293577981651376, "percentage": 52.94, "elapsed_time": "2:23:34", "remaining_time": "2:07:37"} {"current_steps": 900, "total_steps": 1700, "eval_loss": 0.14977410435676575, "epoch": 5.293577981651376, "percentage": 52.94, "elapsed_time": "2:26:45", "remaining_time": "2:10:27"}
No response
The text was updated successfully, but these errors were encountered:
代码是最新吗
Sorry, something went wrong.
0506那天下载的
更新代码
No branches or pull requests
Reminder
Reproduction
torchrun --nproc_per_node ${num_gpu_per_node} --master_port $MASTER_PORT --master_addr $MASTER_ADDR --node_rank $RANK --nnodes $WORLD_SIZE src/train.py
--stage sft
--model_name_or_path /mnt/cluster/zhangfan/models/01-ai/Yi-1.5-9B-Chat
--do_train
--do_eval
--dataset user_sft_prompt_0516_train_classfiy
--template yi
--finetuning_type full
--output_dir /mnt/cluster/test
--preprocessing_num_workers 60
--dataloader_num_workers 60
--val_size 0.03
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--gradient_checkpointing true
--cutoff_len 17000
--max_new_tokens 1000
--max_length 16000
--deepspeed examples/deepspeed/ds_z2_config.json
--logging_strategy steps
--logging_first_step
--logging_steps 1
--save_strategy epoch
--evaluation_strategy steps
--eval_steps 50
--num_train_epochs 10
--lr_scheduler_type cosine
--learning_rate 1e-5
--flash_attn auto
--plot_loss
--bf16
--rope_scaling linear
--save_on_each_node false
--neftune_noise_alpha 5
Expected behavior
训练日志打印不完整,而且只打印了前900步
System Info
{"current_steps": 1, "total_steps": 850, "loss": 3.9585, "learning_rate": 9.999965849158597e-06, "epoch": 0.005871559633027523, "percentage": 0.12, "elapsed_time": "0:03:07", "remaining_time": "1 day, 20:18:20"}
{"current_steps": 10, "total_steps": 850, "loss": 0.7958, "learning_rate": 9.996585300715117e-06, "epoch": 0.05871559633027523, "percentage": 1.18, "elapsed_time": "0:28:37", "remaining_time": "1 day, 16:04:13"}
{"current_steps": 20, "total_steps": 850, "loss": 0.2891, "learning_rate": 9.98634586692894e-06, "epoch": 0.11743119266055047, "percentage": 2.35, "elapsed_time": "0:59:04", "remaining_time": "1 day, 16:51:17"}
{"current_steps": 30, "total_steps": 850, "loss": 0.2419, "learning_rate": 9.96929568447637e-06, "epoch": 0.1761467889908257, "percentage": 3.53, "elapsed_time": "1:27:57", "remaining_time": "1 day, 16:03:59"}
{"current_steps": 40, "total_steps": 850, "loss": 0.2254, "learning_rate": 9.945458041855732e-06, "epoch": 0.23486238532110093, "percentage": 4.71, "elapsed_time": "1:57:32", "remaining_time": "1 day, 15:40:04"}
{"current_steps": 50, "total_steps": 850, "loss": 0.213, "learning_rate": 9.91486549841951e-06, "epoch": 0.29357798165137616, "percentage": 5.88, "elapsed_time": "2:26:05", "remaining_time": "1 day, 14:57:27"}
{"current_steps": 50, "total_steps": 850, "eval_loss": 0.20849353075027466, "epoch": 0.29357798165137616, "percentage": 5.88, "elapsed_time": "2:29:16", "remaining_time": "1 day, 15:48:20"}
{"current_steps": 860, "total_steps": 1700, "loss": 0.1414, "learning_rate": 4.907605475204352e-06, "epoch": 5.058715596330275, "percentage": 50.59, "elapsed_time": "0:28:50", "remaining_time": "0:28:10"}
{"current_steps": 870, "total_steps": 1700, "loss": 0.144, "learning_rate": 4.815242503054277e-06, "epoch": 5.1174311926605505, "percentage": 51.18, "elapsed_time": "0:57:42", "remaining_time": "0:55:02"}
{"current_steps": 880, "total_steps": 1700, "loss": 0.1466, "learning_rate": 4.7229426254201504e-06, "epoch": 5.176146788990826, "percentage": 51.76, "elapsed_time": "1:26:22", "remaining_time": "1:20:28"}
{"current_steps": 890, "total_steps": 1700, "loss": 0.1455, "learning_rate": 4.630737362625631e-06, "epoch": 5.234862385321101, "percentage": 52.35, "elapsed_time": "1:54:46", "remaining_time": "1:44:27"}
{"current_steps": 900, "total_steps": 1700, "loss": 0.1445, "learning_rate": 4.53865820268349e-06, "epoch": 5.293577981651376, "percentage": 52.94, "elapsed_time": "2:23:34", "remaining_time": "2:07:37"}
{"current_steps": 900, "total_steps": 1700, "eval_loss": 0.14977410435676575, "epoch": 5.293577981651376, "percentage": 52.94, "elapsed_time": "2:26:45", "remaining_time": "2:10:27"}
Others
No response
The text was updated successfully, but these errors were encountered: