Autotrain CLI suddenly crashes #643

dejankocic · 2024-05-16T08:44:30Z

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config configs/llm_finetuning/llama3-70b-sft.yml

UI Screenshots & Parameters

No response

Error Logs

INFO | 2024-05-16 10:42:17 | autotrain.cli.autotrain:main:52 - Using AutoTrain configuration: configs/llm_finetuning/llama3-70b-sft.yml
INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:92 - Running task: lm_training
INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:93 - Using backend: local
WARNING | 2024-05-16 10:42:17 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: logging_steps, max_grad_norm, model_ref, evaluation_strategy, lora_dropout, max_completion_length, lora_r, lora_alpha, use_flash_attention_2, disable_gradient_checkpointing, max_prompt_length, warmup_ratio, dpo_beta, auto_find_batch_size, weight_decay, save_total_limit, merge_adapter, prompt_text_column, rejected_text_column, seed, add_eos_token
INFO | 2024-05-16 10:42:17 | autotrain.parser:run:144 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''}
INFO | 2024-05-16 10:42:17 | autotrain.backends.local:create:8 - Starting local training...
INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:349 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']
INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:350 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''}
The following values were not passed to accelerate launch and had defaults used instead:
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
INFO | 2024-05-16 10:42:21 | autotrain.trainers.clm.train_clm_sft:train:14 - Starting SFT training...
INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:352 - Train data: Dataset({
features: ['text'],
num_rows: 321
})
INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:353 - Valid data: None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:423 - configuring logging steps
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:436 - Logging steps: 25
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_training_args:441 - configuring training args
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_block_size:504 - Using block size 1024
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:27 - loading model config...
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:35 - loading model...
Loading checkpoint shards: 27%|███████████████▏ | 8/30 [00:48<02:59, 8.14s/it]Traceback (most recent call last):
File "/home/dejan/python39venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dejan/python39venv/bin/python3.9', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']' died with <Signals.SIGKILL: 9>.
INFO | 2024-05-16 10:43:27 | autotrain.parser:run:149 - Job ID: 398903

Additional Information

The workstation I am running the script has 128GB of RAM, so dont think this is the case. BTW, utilization of RAM goes up to 60GB.

The text was updated successfully, but these errors were encountered:

abhishekkrthakur · 2024-05-16T10:11:50Z

it seems like you are using a single gpu. the config that you are using was tested on 8xH100.

dejankocic · 2024-05-16T13:13:55Z

Indeed it is a single GPU machine, RTX 2080 Ti. Is there any available configuration I could change it, so it runs on this configuration?

hichambht32 · 2024-05-22T11:57:28Z

i dont think so, try to check some memory usage estimation tools like this one in order to know how much vrram you actually need.

dejankocic added the bug Something isn't working label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autotrain CLI suddenly crashes #643

Autotrain CLI suddenly crashes #643

dejankocic commented May 16, 2024

abhishekkrthakur commented May 16, 2024

dejankocic commented May 16, 2024

hichambht32 commented May 22, 2024

Autotrain CLI suddenly crashes #643

Autotrain CLI suddenly crashes #643

Comments

dejankocic commented May 16, 2024

Prerequisites

Backend

Interface Used

CLI Command

UI Screenshots & Parameters

Error Logs

Additional Information

abhishekkrthakur commented May 16, 2024

dejankocic commented May 16, 2024

hichambht32 commented May 22, 2024