Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce pretraining results for Wav2vec2 using LibriSpeech recipe #2512

Open
GasserElbanna opened this issue Apr 16, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@GasserElbanna
Copy link

Describe the bug

Hello, I am pretraining Wav2vec2 following the instructions on this page. The pretraining went very smoothly (thank you for that!!), however, when I compared the training logs with the one published here, I found that my model finished the 400K steps in only 25 epochs (used 8 A100 gpus) with lower accuray (~60%) as opposed to 700 epochs and accuracy of around 68% as in the example checkpoint. Also, my training finished within 2 days only which is confusing.

Expected behaviour

I expect similar training performance when looking at the training logs.

To Reproduce

Below is the training log for my model which is different from the example one here:

epoch: 1, steps: 18223, lr: 3.04e-04 - train loss: 2.13e+04 - valid loss: 2.37e+03, valid accuracy: 0.35966309905052185
epoch: 2, steps: 36446, lr: 4.51e-04 - train loss: 1.66e+04 - valid loss: 2.07e+03, valid accuracy: 0.4265761077404022
epoch: 3, steps: 54669, lr: 4.28e-04 - train loss: 1.53e+04 - valid loss: 1.91e+03, valid accuracy: 0.46357864141464233
epoch: 4, steps: 72892, lr: 4.05e-04 - train loss: 1.47e+04 - valid loss: 1.83e+03, valid accuracy: 0.485844224691391
epoch: 5, steps: 91115, lr: 3.83e-04 - train loss: 1.43e+04 - valid loss: 1.77e+03, valid accuracy: 0.4983205795288086
epoch: 6, steps: 109338, lr: 3.60e-04 - train loss: 1.39e+04 - valid loss: 1.73e+03, valid accuracy: 0.5098890066146851
epoch: 7, steps: 127561, lr: 3.38e-04 - train loss: 1.36e+04 - valid loss: 1.68e+03, valid accuracy: 0.5208209753036499
epoch: 8, steps: 145784, lr: 3.15e-04 - train loss: 1.33e+04 - valid loss: 1.63e+03, valid accuracy: 0.5316717028617859
epoch: 9, steps: 164007, lr: 2.93e-04 - train loss: 1.30e+04 - valid loss: 1.59e+03, valid accuracy: 0.5391503572463989
epoch: 10, steps: 182230, lr: 2.70e-04 - train loss: 1.26e+04 - valid loss: 1.56e+03, valid accuracy: 0.5474251508712769
epoch: 11, steps: 200453, lr: 2.47e-04 - train loss: 1.23e+04 - valid loss: 1.52e+03, valid accuracy: 0.5530011653900146
epoch: 12, steps: 218676, lr: 2.25e-04 - train loss: 1.21e+04 - valid loss: 1.49e+03, valid accuracy: 0.5636028051376343
epoch: 13, steps: 236899, lr: 2.02e-04 - train loss: 1.18e+04 - valid loss: 1.46e+03, valid accuracy: 0.5687620639801025
epoch: 14, steps: 255122, lr: 1.80e-04 - train loss: 1.17e+04 - valid loss: 1.45e+03, valid accuracy: 0.5734390020370483
epoch: 15, steps: 273345, lr: 1.57e-04 - train loss: 1.15e+04 - valid loss: 1.43e+03, valid accuracy: 0.5788331031799316
epoch: 16, steps: 291568, lr: 1.34e-04 - train loss: 1.13e+04 - valid loss: 1.42e+03, valid accuracy: 0.5822480916976929
epoch: 17, steps: 309791, lr: 1.12e-04 - train loss: 1.12e+04 - valid loss: 1.40e+03, valid accuracy: 0.586252748966217
epoch: 18, steps: 328014, lr: 8.92e-05 - train loss: 1.11e+04 - valid loss: 1.39e+03, valid accuracy: 0.5907050967216492
epoch: 19, steps: 346237, lr: 6.66e-05 - train loss: 1.10e+04 - valid loss: 1.37e+03, valid accuracy: 0.596407413482666
epoch: 20, steps: 364460, lr: 4.41e-05 - train loss: 1.08e+04 - valid loss: 1.36e+03, valid accuracy: 0.5983026623725891
epoch: 21, steps: 382683, lr: 2.15e-05 - train loss: 1.08e+04 - valid loss: 1.34e+03, valid accuracy: 0.6026105880737305
epoch: 22, steps: 400000, lr: 0.00e+00 - train loss: 1.07e+04 - valid loss: 1.34e+03, valid accuracy: 0.6060941815376282
epoch: 23, steps: 400000, lr: 0.00e+00 - train loss: 0.00e+00 - valid loss: 1.33e+03, valid accuracy: 0.6060227155685425
epoch: 24, steps: 400000, lr: 0.00e+00 - train loss: 0.00e+00 - valid loss: 1.33e+03, valid accuracy: 0.6051703691482544
epoch: 25, steps: 400000, lr: 0.00e+00 - train loss: 0.00e+00 - valid loss: 1.33e+03, valid accuracy: 0.6063333749771118

Environment Details

I am using python 3.11 and speechbrain 1.0

Relevant Log Output

No response

Additional Context

No response

@GasserElbanna GasserElbanna added the bug Something isn't working label Apr 16, 2024
@Adel-Moumen Adel-Moumen self-assigned this Apr 18, 2024
@Adel-Moumen
Copy link
Collaborator

Hello @GasserElbanna, thanks a lot for opening this issue!

Could you please @TParcollet and/or @salah-zaiem have a look? Thanks a lot :)

@TParcollet
Copy link
Collaborator

Hi, it's important that the total batch size corresponds to roughly 1.6h. By changing the gradient accumulation factor your can adjust this.

@GasserElbanna
Copy link
Author

Hello, thank you for the quick response. I used the default config file for pre-training. So, I am assuming these are the parameters below I need to adjust?

Dynamic Batching parameters:
max_batch_length: 200 # Fits in a 32GB GPUs (V100)
num_buckets: 70
shuffle: True # if true re-creates batches at each epoch shuffling examples.
batch_ordering: random

@TParcollet
Copy link
Collaborator

@Adel-Moumen i see that the gradient accumulation factor is missing on this recipe. Could you add it? (No need to PR imho push directly to develop).

@GasserElbanna have a look at any other yaml for asr in the libri folder, you will find the gradient accumulation factor param. Just copy and past it in this yaml, anywhere. Then play with grad accum / max batch len to make sure that you have 1.2-1.6h of speech per batch. Grad_accum * max_batch_len * nb gpu = 1.6h.

Also, your A100 must certainly be able to accommodate more than 200s.

@Adel-Moumen
Copy link
Collaborator

@Adel-Moumen i see that the gradient accumulation factor is missing on this recipe. Could you add it? (No need to PR imho push directly to develop).

Why would it be missing? By default, grad_accumulation_factor is set to 1 (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/core.py#L84). The var is called in each fit_batch call (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/core.py#L1199). As grad_accumulation_factor can also be set through a flag (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/core.py#L422-L426) the recipe is technically not missing from this feature. You just need to play with --grad_accumulation_factor=N where N is the grad acc steps.

@GasserElbanna
Copy link
Author

Hi, thanks @TParcollet for the explanation, it's clearer now.
Thanks @Adel-Moumen for pointing out the flag.

I am currently pretraining with --grad_accumulation_factor=2 and max_batch_length=400 on 8 gpus yielding 2 * 400 * 8 = 6400 (~1.8h).

Here's the logs for the first epoch:
epoch: 1, steps: 4611, lr: 7.68e-05 - train loss: 4.84e+04 - valid loss: 2.86e+03, valid accuracy: 0.26230588555336

@Adel-Moumen
Copy link
Collaborator

Hi, thanks @TParcollet for the explanation, it's clearer now. Thanks @Adel-Moumen for pointing out the flag.

I am currently pretraining with --grad_accumulation_factor=2 and max_batch_length=400 on 8 gpus yielding 2 * 400 * 8 = 6400 (~1.8h).

Here's the logs for the first epoch: epoch: 1, steps: 4611, lr: 7.68e-05 - train loss: 4.84e+04 - valid loss: 2.86e+03, valid accuracy: 0.26230588555336

Seems to be similar to our model checkpoint. Note that now you have done during your first epoch "only" 4611 steps meaning that the training will go for much longer. I do expect that you'll get better results.

BTW, are you using --precision=fp16 for the pre-training?

@GasserElbanna
Copy link
Author

BTW, are you using --precision=fp16 for the pre-training?

I am using fp32 now.

@TParcollet
Copy link
Collaborator

fp16 or bf16 would make the training much faster if you have a compatible GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants