Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autotrain CLI suddenly crashes #643

Open
2 tasks done
dejankocic opened this issue May 16, 2024 · 3 comments
Open
2 tasks done

Autotrain CLI suddenly crashes #643

dejankocic opened this issue May 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@dejankocic
Copy link

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config configs/llm_finetuning/llama3-70b-sft.yml

UI Screenshots & Parameters

No response

Error Logs

INFO | 2024-05-16 10:42:17 | autotrain.cli.autotrain:main:52 - Using AutoTrain configuration: configs/llm_finetuning/llama3-70b-sft.yml
INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:92 - Running task: lm_training
INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:93 - Using backend: local
WARNING | 2024-05-16 10:42:17 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: logging_steps, max_grad_norm, model_ref, evaluation_strategy, lora_dropout, max_completion_length, lora_r, lora_alpha, use_flash_attention_2, disable_gradient_checkpointing, max_prompt_length, warmup_ratio, dpo_beta, auto_find_batch_size, weight_decay, save_total_limit, merge_adapter, prompt_text_column, rejected_text_column, seed, add_eos_token
INFO | 2024-05-16 10:42:17 | autotrain.parser:run:144 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''}
INFO | 2024-05-16 10:42:17 | autotrain.backends.local:create:8 - Starting local training...
INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:349 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']
INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:350 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': '
'}
The following values were not passed to accelerate launch and had defaults used instead:
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
INFO | 2024-05-16 10:42:21 | autotrain.trainers.clm.train_clm_sft:train:14 - Starting SFT training...
INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:352 - Train data: Dataset({
features: ['text'],
num_rows: 321
})
INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:353 - Valid data: None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:423 - configuring logging steps
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:436 - Logging steps: 25
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_training_args:441 - configuring training args
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_block_size:504 - Using block size 1024
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:27 - loading model config...
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:35 - loading model...
Loading checkpoint shards: 27%|███████████████▏ | 8/30 [00:48<02:59, 8.14s/it]Traceback (most recent call last):
File "/home/dejan/python39venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dejan/python39venv/bin/python3.9', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']' died with <Signals.SIGKILL: 9>.
INFO | 2024-05-16 10:43:27 | autotrain.parser:run:149 - Job ID: 398903

Additional Information

The workstation I am running the script has 128GB of RAM, so dont think this is the case. BTW, utilization of RAM goes up to 60GB.

@dejankocic dejankocic added the bug Something isn't working label May 16, 2024
@abhishekkrthakur
Copy link
Member

it seems like you are using a single gpu. the config that you are using was tested on 8xH100.

@dejankocic
Copy link
Author

Indeed it is a single GPU machine, RTX 2080 Ti. Is there any available configuration I could change it, so it runs on this configuration?

@hichambht32
Copy link

i dont think so, try to check some memory usage estimation tools like this one in order to know how much vrram you actually need.
usagegpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants