Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility issue with transformers (BERT) and tf2.2 #19

Open
MFreidank opened this issue Jun 16, 2020 · 21 comments
Open

Reproducibility issue with transformers (BERT) and tf2.2 #19

MFreidank opened this issue Jun 16, 2020 · 21 comments

Comments

@MFreidank
Copy link

MFreidank commented Jun 16, 2020

Dear @duncanriach,
Thank you for your contributions, work and guidance towards making tensorflow deterministic in the recent releases.
Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14).

In spite of combining learnings from:

... I am still arriving at the following short, non-deterministic colab notebook example.

My results for the sum of model weights (as computed with a function you had suggested) after training for only 5 steps is (differences are highlighted below):

Device Before training After training
Run 1 GPU -641227.5609667897224 -641237.442 5159916282
Run 2 GPU -641227.5609667897224 -641237.442 3093758523
Run 1 CPU -641227.5609667301178 -641238.1506845243275
Run 2 CPU -641227.5609667301178 -641238.1506845243275

This variance gets increasingly more pronounced when the model is trained for longer periods of time.

Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?

As transformers is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.

Note: As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime.

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 16, 2020

Beautifully presented. Thanks, @MFreidank. I made a copy of your colab code and have been looking at it. The primary issue right now is that the trainable variables are not matching between runs:

### Before training: ###
Summary of weights: -641227.5609667897224
### Before training: ###
Summary of weights: -641227.7293046712875

I can see that you have them matching, and I don't understand that would be different for me. Have you changed the colab code in some way since you ran it?

The second issue I see is that you're setting from_logits=True in the constructor of tf.keras.losses.SparseCategoricalCrossentropy. As your notes suggest, this argument should be excluded (or set to False).

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 16, 2020

Oh, I see. You have to restart the runtime to get the same initial trainable variables. I can hopefully provide a work-around for that too.

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 16, 2020

So, the solution for getting the same initial trainable variables every time you run the block of code that starts with the definition of summarize_keras_weights is to call tf.random.set_seed at the beginning of that block. This will reset the pseudorandom number generator that is used to initialize the trainable variables of the model.

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 16, 2020

And ... solved. By removing from_logits=True from the constructor of tf.keras.losses.SparseCategoricalCrossentropy() I was able to get the same trainable variables after both runs.

### Before training: ###
Summary of weights: -641227.5609667897224
5/5 [==============================] - 7s 1s/step - loss: 0.7225 - accuracy: 0.4000
### After training: ###
Summary of weights: -641238.1517347339541
### Before training: ###
Summary of weights: -641227.5609667897224
5/5 [==============================] - 7s 1s/step - loss: 0.7225 - accuracy: 0.4000
### After training: ###
Summary of weights: -641238.1517347339541

You were so close. If you only you coded exactly what your notes required. :-)

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 16, 2020

Please confirm that your issue has been solved. Train your model for much longer, at least for one whole epoch, and confirm that it's getting the accuracy you expect while also getting the perfect, bit-exact reproducibility.

@duncanriach duncanriach changed the title Reproducibility issue with tf2.2 Reproducibility issue with transformers (BERT) and tf2.2 Jun 16, 2020
@MFreidank
Copy link
Author

MFreidank commented Jun 17, 2020

@duncanriach Thank you! I can reproduce the resolution and things are now deterministic in the scenario above - should have taken my own advice from the notes based on your workaround in the tensorflow issue thread ;)

There is an issue remaining though: changing epochs=1 to epochs=2 reintroduces non-determinism (even when keeping steps_per_epoch at only 5).
Note that training for the same 10 steps by using epochs=1, steps_per_epoch=10 is deterministic.

Could you have a look at this? I updated my colab notebook to reflect the current state and expose the issue mentioned above.

Almost looks like keras is doing some non-deterministic operations in between epochs.
For my purposes, I may be able to simply artificially stretch the epoch I am training for (to multiple passes over the dataset) and get things running deterministic this way; I'll investigate this.
Nevertheless, I believe this warrants some further investigation, happy to help in any way I can.

Update: For epochs=2, steps_per_epoch=10, I found it to be reproducible on the CPU.
So the issue must occur on something that does relate to GPU.

@duncanriach
Copy link
Collaborator

Could you have a look at this?

Will do.

Almost looks like Keras is doing some non-deterministic operations in between epochs.

These between-epoch issues are common and there as several possible sources. Let's see if we can get determinism without you needing to limit the training to one epoch ...

@duncanriach
Copy link
Collaborator

Running in colab, with my old copy of your code (with the fixes), I'm now no longer seeing reproducibility on 5 steps in one epoch on GPU. This is very concerning and I have not yet figured out what the issue is. Also, looking at your updated colab code and notes, it seems that one epoch with 10 steps on the GPU is not operating reproducibly, which does not match what you wrote above.

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 18, 2020

Just to recap where we're at and the solutions we have:

  1. Using tf.random.set_seed, reset TensorFlow's PRNG before initializing trainable variables.
  2. Set TF_DETERMINISTIC_OPS=1 to enable all deterministic ops in the model.
  3. Replace non-deterministic fused softmax/cross-entropy with a deterministic version. You and I are also working on adding a fix for this, which will also be under the control of TF_DETERMINISTIC_OPS.

With these three adjustments, there is still some non-determinism. However, rather than just being totally different on every run, the final state of the trainable variables is now one of a discrete set of values. The number of possible values seems to increase with the number of steps per epoch.

With steps_per_epoch=1 and steps_per_epoch=2, I got the same final value after several runs.

With steps_per_epoch=5, over seven runs, I got only four different results. One result was repeated three times and another was repeated twice.

What this suggests to me is that there may be some non-determinism in the interaction between the data-loader (based on tf.data) and model.fit. I've seen things like this before, but not exactly like this, and nothing jumps out at me from your code that could be causing this (such a multiple data-loader workers or an unseeded shuffle).

I'll investigate more tomorrow.

@MFreidank
Copy link
Author

@duncanriach Thank you a lot for your work and drive on this and for the conclusive summary of where we stand and what we know.
I agree with all your points, but was not able to pinpoint the exact source of the problem (I tried setting workers=0 to make it run on the same thread as the main training loop, but to no avail).
Looking forward to your further investigation and happy to help from my side in any way I can.

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 20, 2020

I've been trying different batch sizes and number of steps. There seems to be a non-determinism effect that kicks-in with larger batch size and a seemingly independent effect related to the number of steps (and/or perhaps the number of examples trained). This is reminding me of the unresolved aspects of issue 9 (for OpenNMT).

I have not yet gotten the non-determinism debug tool working with this model, which will enable me to dig-in more deeply to isolate the remaining source, or sources, of non-determinism. I'm also learning more about BERT and about transformers in general.

I presume that each step of training this model runs the complete sentence example (or batch of sentence examples) through the encoder and then the decoder, then calculates the loss, and then back-propagates the gradients to the trainable variables. If we see non-determinism in the trainable variables appear on any given step, it will have been caused by that example (or the examples in that batch) interacting with the trainable variables, as they have been trained during the previous steps, via a non-deterministic op or process.

Since this is an RNN model, and a relatively complex one, there is extensive iterative munging happening (although I believe that it will be unrolled), unlike with a non-recurrent DNN. There may be different opportunities for non-determinism to be injected. There may also be the use of sparse operations (for things like sparse embeddings), some of which have been suspect for a while (but have not yet been fully investigated).

I intend to keep investigating this issue.

BTW, in a comment in the code, you mention that the data loading and preparation is deterministic. Did you confirm that. If so, how?

@duncanriach
Copy link
Collaborator

duncanriach commented Oct 23, 2020

@MFreidank, we (@wenscarl and I) have isolated the remaining source of nondeterminism in this model. See this comment on TensorFlow Issue 39751 for more information about the source.

We have also confirmed that this was the only remaining source of nondeterminism in the model by temporarily replacing the use of tf.gather in the huggingface/transformers BERT code with a much slower tf.linalg.matmul operation, the dense backprop output of which can be used directly to update the word embedding matrix (without the need for the currently-nondeterministic tf.convert_to_tensor). The model trained reproducibly for thousands of batches, over multiple epochs.

We are close to releasing a patch for the TensorFlow segment sum ops, which will, when applied via fwd9m.tensorflow.enable_determinism, will remove this final source of nondeterminism.

@duncanriach
Copy link
Collaborator

Update, @wenscarl has confirmed that the patch we are about to release (to be enabled via fwd9m.tensorflow.enable_determinism) will resolve the final source of nondeterminism in this model, causing it to train determinismtically.

@Zminghua
Copy link

@duncanriach very thank you for your contributions towards making tensorflow deterministic. I am using huggingface/transformers BERT with tf2.2. And I was wondering what is the time to release the patch.

@duncanriach
Copy link
Collaborator

Hi @Zminghua, I don't have an estimated release date for the patch, but it's relatively high priority for us. The patch will work with TensorFlow version 2.3 and earlier. A recently-discovered problem, which we're attempting to find a solution for, is that from TensorFlow version 2.4 onwards the TensorFlow API no longer exposes the mechanisms that allow for a dynamic patch to be applied from outside the distributed package. This means that we'll have to focus on getting these solutions into upstream stock TensorFlow rather than relying on the theoretically quick triage route provided by patching.

@Zminghua
Copy link

Zminghua commented Jan 6, 2021

Hi @duncanriach,

through putting "fwd9m" sub-dir in my project dir,

then import as follows
from fwd9m.tensorflow import enable_determinism
enable_determinism()

my code has become fully deterministic when running on GPU.

really thank you again ~

@duncanriach
Copy link
Collaborator

Oh, you're welcome. Right, you can just clone the code and use it, of course, rather than waiting for the PyPI release.

@duncanriach
Copy link
Collaborator

duncanriach commented Jan 12, 2021

Update: we have confirmed that fwd9m.tensorflow.enable_determinism, which currently includes patching of segment_sum and unsorted_segment_sum will, in fact, work on TensorFlow 2.4.0. I don't understand why this is. It's not what I expect given what was in the version 2.4.0 release notes and the associated changes in the stock TensorFlow source code.

@phqtuyen
Copy link

phqtuyen commented Jun 9, 2021

I cloned the repository and follow the instruction from above, i.e
from framework_determinism.fwd9m.tensorflow import enable_determinism.
However, I get this error message:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/mnt/shared_ad2_mt1/thopham/projects/exp-1/oda-cognitive/services/train-pool/train-models/framework_determinism/fwd9m/tensorflow/enable_determinism.py", line 61, in _enable_determinism patch_bias_add(_silent=True) TypeError: 'module' object is not callable
I am using Python3.6

Much appreciated.

@duncanriach
Copy link
Collaborator

duncanriach commented Jun 11, 2021

Hi @phqtuyen,

Please pull the master branch and try again.

This was a bug that only showed up with stock TensorFlow versions 1.14 through 2.0. It was fixed in the incomplete and un-merged integration-testing branch. This demonstrates the hazards involved in using unreleased (and non-regression-tested) code.

Let me know how it goes.

@duncanriach
Copy link
Collaborator

duncanriach commented Sep 17, 2021

This should be fixed in TF 2.7 by PR 51861. Please will someone confirm so that this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants