You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now all that's left is a trainer to start/continue training:
trainer = Trainer(
model=model,
data_collator=CustomDataCollator(), # This only creates training batches (tried with some default as well)
args=training_args,
train_dataset=ds["training"],
eval_dataset=ds["validation"],
)
trainer.train(resume_from_checkpoint=True)
Whenever I start training from scratch (not resuming from checkpoint), everything works fine, and I can train for days. But as soon as I want to start from a checkpoint saved during training, I get an OutOfMemory error. The GPU is not occupied by any other task, and I know for sure that there are no leaks from other processes happening. At the same time, the OOM says that it failed allocating 120 MiB GPU Memory, but in fact, more than 7 GiB are still free according to nvidia-smi.
Expected behavior
Returning from checkpoint should not run into any OOM problems if the model trained successfully before. The expected behavior can be achieved by setting os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128", but this is a) only a hacky solution and b) results in much longer training times.
The text was updated successfully, but these errors were encountered:
A quick update: If I run a forward/backward pass myself using native torch with the same batch size, everything works fine if I stick to mixed precision (like in the TrainingArguments above):
with autocast(device_type='cuda', dtype=torch.float16):
y = model(**batch)
y["loss"].backward()
I only run into an OOM if I omit mixed precision. Maybe it is related to that.
I have this same issue but for me this also includes the use of deepspeed. The OOM is not on the GPU but in the system memory because of CPU offloading of activation weights. However, just like this issue, the OOM is only when resuming and not when starting training from scratch. I've been digging into this issue for a while on the deepspeed side so if anyone has any further avenues of potential fixes that would be helpful.
I think it might have something to do with this dead deepspeed issue but none of the proposed fixes work.
A quick update: If I run a forward/backward pass myself using native torch with the same batch size, everything works fine if I stick to mixed precision (like in the TrainingArguments above):
with autocast(device_type='cuda', dtype=torch.float16):
y = model(**batch)
y["loss"].backward()
I only run into an OOM if I omit mixed precision. Maybe it is related to that.
If I was to use autocast with the HF trainer where do I put it? Like this?
System Info
Using GPU in script: A100 80 GB; Driver Version: 550.54.15; CUDA-Version: 12.4
Using distributed or parallel setup: No
Who can help?
@ArthurZucker @muellerz @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This is the definition of my custom model:
I have a dataset with long texts which are chunked to samples of 1024 tokens (padded to said length if required).
These are my training arguments:
Now all that's left is a trainer to start/continue training:
Whenever I start training from scratch (not resuming from checkpoint), everything works fine, and I can train for days. But as soon as I want to start from a checkpoint saved during training, I get an OutOfMemory error. The GPU is not occupied by any other task, and I know for sure that there are no leaks from other processes happening. At the same time, the OOM says that it failed allocating 120 MiB GPU Memory, but in fact, more than 7 GiB are still free according to nvidia-smi.
Expected behavior
Returning from checkpoint should not run into any OOM problems if the model trained successfully before. The expected behavior can be achieved by setting
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
, but this is a) only a hacky solution and b) results in much longer training times.The text was updated successfully, but these errors were encountered: