We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local
CLI
autotrain --config configs/llm_finetuning/llama3-70b-sft.yml
No response
INFO | 2024-05-16 10:42:17 | autotrain.cli.autotrain:main:52 - Using AutoTrain configuration: configs/llm_finetuning/llama3-70b-sft.yml INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:92 - Running task: lm_training INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:93 - Using backend: local WARNING | 2024-05-16 10:42:17 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: logging_steps, max_grad_norm, model_ref, evaluation_strategy, lora_dropout, max_completion_length, lora_r, lora_alpha, use_flash_attention_2, disable_gradient_checkpointing, max_prompt_length, warmup_ratio, dpo_beta, auto_find_batch_size, weight_decay, save_total_limit, merge_adapter, prompt_text_column, rejected_text_column, seed, add_eos_token INFO | 2024-05-16 10:42:17 | autotrain.parser:run:144 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''} INFO | 2024-05-16 10:42:17 | autotrain.backends.local:create:8 - Starting local training... INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:349 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json'] INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:350 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''} The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. INFO | 2024-05-16 10:42:21 | autotrain.trainers.clm.train_clm_sft:train:14 - Starting SFT training... INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:352 - Train data: Dataset({ features: ['text'], num_rows: 321 }) INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:353 - Valid data: None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:423 - configuring logging steps INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:436 - Logging steps: 25 INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_training_args:441 - configuring training args INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_block_size:504 - Using block size 1024 INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:27 - loading model config... INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:35 - loading model... Loading checkpoint shards: 27%|███████████████▏ | 8/30 [00:48<02:59, 8.14s/it]Traceback (most recent call last): File "/home/dejan/python39venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1075, in launch_command simple_launcher(args) File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 681, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/dejan/python39venv/bin/python3.9', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']' died with <Signals.SIGKILL: 9>. INFO | 2024-05-16 10:43:27 | autotrain.parser:run:149 - Job ID: 398903
accelerate launch
--dynamo_backend
'no'
accelerate config
The workstation I am running the script has 128GB of RAM, so dont think this is the case. BTW, utilization of RAM goes up to 60GB.
The text was updated successfully, but these errors were encountered:
it seems like you are using a single gpu. the config that you are using was tested on 8xH100.
Sorry, something went wrong.
Indeed it is a single GPU machine, RTX 2080 Ti. Is there any available configuration I could change it, so it runs on this configuration?
i dont think so, try to check some memory usage estimation tools like this one in order to know how much vrram you actually need.
No branches or pull requests
Prerequisites
Backend
Local
Interface Used
CLI
CLI Command
autotrain --config configs/llm_finetuning/llama3-70b-sft.yml
UI Screenshots & Parameters
No response
Error Logs
INFO | 2024-05-16 10:42:17 | autotrain.cli.autotrain:main:52 - Using AutoTrain configuration: configs/llm_finetuning/llama3-70b-sft.yml
INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:92 - Running task: lm_training
INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:93 - Using backend: local
WARNING | 2024-05-16 10:42:17 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: logging_steps, max_grad_norm, model_ref, evaluation_strategy, lora_dropout, max_completion_length, lora_r, lora_alpha, use_flash_attention_2, disable_gradient_checkpointing, max_prompt_length, warmup_ratio, dpo_beta, auto_find_batch_size, weight_decay, save_total_limit, merge_adapter, prompt_text_column, rejected_text_column, seed, add_eos_token
INFO | 2024-05-16 10:42:17 | autotrain.parser:run:144 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''}
INFO | 2024-05-16 10:42:17 | autotrain.backends.local:create:8 - Starting local training...
INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:349 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']
INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:350 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''}
The following values were not passed to
accelerate launch
and had defaults used instead:--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or run
accelerate config
.INFO | 2024-05-16 10:42:21 | autotrain.trainers.clm.train_clm_sft:train:14 - Starting SFT training...
INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:352 - Train data: Dataset({
features: ['text'],
num_rows: 321
})
INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:353 - Valid data: None
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:423 - configuring logging steps
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:436 - Logging steps: 25
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_training_args:441 - configuring training args
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_block_size:504 - Using block size 1024
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:27 - loading model config...
INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:35 - loading model...
Loading checkpoint shards: 27%|███████████████▏ | 8/30 [00:48<02:59, 8.14s/it]Traceback (most recent call last):
File "/home/dejan/python39venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dejan/python39venv/bin/python3.9', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']' died with <Signals.SIGKILL: 9>.
INFO | 2024-05-16 10:43:27 | autotrain.parser:run:149 - Job ID: 398903
Additional Information
The workstation I am running the script has 128GB of RAM, so dont think this is the case. BTW, utilization of RAM goes up to 60GB.
The text was updated successfully, but these errors were encountered: