You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe this happens because the compute_metrics function computes the eval_f1 metric on one process but the trainer _save_checkpoint() method checks for the metric on all processes and therefore, some other process beats process 0 doesn't find the key leading to this error. (
Ideally, the code should seamlessly run using torchrun. There should be no key error. The trainer should be able to handle single process eval_f1 along with multi-process metric computation done in other workloads such as summarization.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.41.0.dev0Who can help?
@muellerzr @pacman100 @ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
pip install -e .
in the clonedtransformers
folder.torchrun --nproc_per_node 2 run_qa.py --model_name_or_path google-bert/bert- base-uncased --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad --max_steps 20 --eval_steps 2 --save_steps 2 --save_total_limit 2 --load_best_model_at_end True --metric_for_best_model eval_f1 --max_eval_samples 20 --eval_strategy steps --save_strategy steps 2>&1 | tee scratch.log
_save_checkpoint()
method checks for the metric on all processes and therefore, some other process beats process 0 doesn't find the key leading to this error. (transformers/src/transformers/trainer.py
Line 2820 in 1360801
Expected behavior
Ideally, the code should seamlessly run using torchrun. There should be no key error. The trainer should be able to handle single process eval_f1 along with multi-process metric computation done in other workloads such as summarization.
The text was updated successfully, but these errors were encountered: