Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Aadm is slower than Adafactor #1920

Open
shizhediao opened this issue Nov 26, 2022 · 0 comments
Open

Aadm is slower than Adafactor #1920

shizhediao opened this issue Nov 26, 2022 · 0 comments

Comments

@shizhediao
Copy link

Hi,
I found that training Transformers with Adam is three times slower than with Adafactor.
Here is the command I am using for Adam:

t2t-trainer \
  --data_dir=./t2t/t2t_data \
  --problem=translate_ende_wmt32k \
  --model=transformer \
  --hparams_set=transformer_base \
  --hparams="batch_size=1024,learning_rate_schedule=constant*linear_warmup*rsqrt_decay, learning_rate_constant=0.1,optimizer_adam_beta2=0.999" \
  --schedule=continuous_train_and_eval \
  --output_dir=./t2t/t2t_train/translate_ende_wmt32k_adam_lineB \
  --train_steps=300000 \
  --worker_gpu=10 \
  --eval_steps=5000

Here is the command I am using for Adafactor:

t2t-trainer \
  --data_dir=./t2t/t2t_data \
  --problem=translate_ende_wmt32k \
  --model=transformer \
  --hparams_set=transformer_base \
  --hparams="optimizer_adafactor_factored=False,batch_size=1024,optimizer=Adafactor,learning_rate_schedule=constant*linear_warmup*rsqrt_decay, learning_rate_constant=0.1,optimizer_adafactor_multiply_by_parameter_scale=False" \
  --schedule=continuous_train_and_eval \
  --output_dir=./t2t/t2t_train/translate_ende_wmt32k_adafactor_lineN \
  --train_steps=300000 \
  --worker_gpu=10 \
  --eval_steps=5000

I found that training 100 steps cost 240 seconds for Adam, while it just needs 80s for Adafactor.
Could anyone help take a look?

Thanks very much!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant