Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] add torch.use_deterministic_algorithms for XPU #30774

Merged
merged 9 commits into from
May 23, 2024

Conversation

faaany
Copy link
Contributor

@faaany faaany commented May 13, 2024

What does this PR do?

There are tests that compare numerical difference of inference results between 2 separate model runs. These tests fail on XPU, because deterministic is not the default mode in oneDNN. Below is an example:

import torch
import intel_extension_for_pytorch
from transformers import set_seed

set_seed(20)
model = torch.nn.Linear(768, 768, bias=True).to("xpu")

hs = torch.randn(768, 768).to("xpu")

query_states = model(hs)
query_states2 = model(hs)

re = (query_states == query_states2).all().item()
print(re)

# on XPU
>>>> False
# on CUDA&CPU
>>>> True

Although the difference in this example is from precision level of 1e-8, it will cause some some model tests fail. This PR fixes these failed tests by add torch.use_deterministic_algorithms(True).

@ArthurZucker and @amyeroberts

@amyeroberts amyeroberts requested a review from ydshieh May 13, 2024 12:30
@ydshieh
Copy link
Collaborator

ydshieh commented May 13, 2024

Thanks @faaany !

It's probably better to set this in a global way IMO, i.e. set it in a single place (only when we launch tests on a XPU device)? WDYT?

@faaany
Copy link
Contributor Author

faaany commented May 13, 2024

Hi @ydshieh, that's correct! I almost forgot the spec.py file. I will put the function into spec.py and close this PR. Thanks for the review!

@faaany faaany closed this May 13, 2024
@faaany faaany reopened this May 13, 2024
@faaany
Copy link
Contributor Author

faaany commented May 13, 2024

Hi @ydshieh , I just finished running all tests on XPU with torch.use_deterministic_algorithms(True) in spec.py and I am afraid that I have to add it in individual tests, because adding it in a global place will cause a lot of other tests fail.

@ydshieh
Copy link
Collaborator

ydshieh commented May 13, 2024

@faaany

Could you share what kinds of errors we have if we add torch.use_deterministic_algorithms(True) in spec.py?

@faaany
Copy link
Contributor Author

faaany commented May 13, 2024

@faaany

Could you share what kinds of errors we have if we add torch.use_deterministic_algorithms(True) in spec.py?

sure, different kinds of errors, e.g.

RuntimeError: linearIndex.numel() * sliceSize * nElemBefore == expandedValue.numel() INTERNAL ASSERT FAILED at "/build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.cpp":1289, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor: 160 vs 5
RuntimeError: grid_sampler_2d_backward_xpu does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.
RuntimeError: scatter_add_dpcpp_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

@faaany
Copy link
Contributor Author

faaany commented May 13, 2024

Some errors happen at "Variable._execution_engine.run_backward()" and the majority errors happen because shape don't match.

@ydshieh ydshieh self-assigned this May 13, 2024
@ydshieh
Copy link
Collaborator

ydshieh commented May 14, 2024

Hi @faaany I would need a bit more context (in particular I don't have a XPU device to test) 🙏

So I am wondering why adding torch.use_deterministic_algorithms(True) globally will make other (more) tests fail? Are such failures expected (i.e. are they known to not working with torch.use_deterministic_algorithms(True)?).

From the description, IIRC, those extra tests (failed with torch.use_deterministic_algorithms(True)) would pass if use_deterministic_algorithms is not set to True?

@faaany
Copy link
Contributor Author

faaany commented May 16, 2024

Hi @ydshieh, I tried to add torch.use_deterministic_algorithms(True) in CUDA environment and those tests that are failed in XPU environment also fail in CUDA with the following message:

E       RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

e.g.

pytest tests/models/tapas/test_modeling_tapas.py::TapasModelTest::test_problem_types -rA
pytest tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_for_question_answering -rA
pytest tests/models/encoder_decoder/test_modeling_encoder_decoder.py::RoBertaEncoderDecoderModelTest -rA

And my spec.py is like this:

import torch
DEVICE_NAME = "cuda"
torch.use_deterministic_algorithms(True)
MANUAL_SEED_FN = torch.cuda.manual_seed
EMPTY_CACHE_FN = torch.cuda.empty_cache
DEVICE_COUNT_FN = torch.cuda.device_count

The only difference is that the error messages in CUDA are consistent saying that this operation is not deterministic and users should not set deterministic=True, while on XPU oneDNN also mentions the concrete operation that are not supported.

Since it is up to libraries whether certain operation should be deterministic by default (cuDNN and oneDNN in our case) and it is not ok to add this flag to a global place, I think the fix in this PR should be fine. WDYT?

@ydshieh
Copy link
Collaborator

ydshieh commented May 16, 2024

Hi @faaany

Thanks for sharing. I am open to what is proposed in this PR, but I have one more question. If we do

        if "xpu" in torch_device:
            torch.use_deterministic_algorithms(True)

in a test method, it actually changes torch's property once after. All the subsequent tests run after that test will always have deterministic being True and should have the same issue as if we set in the first place.

We might not see that failures due to the luck (depending on the order of tests being run) but there is no guarantee.


Since this is currently only XPU related, it's fine from my side if you want to move forward with this change. The above is just FYI (but I might be wrong)

However I would prefer to use a decorator like require_deterministic_for_xpu and it does

         if "xpu" in torch_device:
            torch.use_deterministic_algorithms(True)

and we just use decorator on some methods that requires it.

@faaany
Copy link
Contributor Author

faaany commented May 16, 2024

Hi @ydshieh , you are right! This function will have a global impact. If I put it in a test method, the following tests will be affected.

Furthermore, I found that the function torch.use_deterministic_algorithms() have a flag called warn_only. So if I use torch.use_deterministic_algorithms(mode=True, warn_only=True) on CUDA, the failed tests I mentioned earlier will give a warning instead of failing. Unfortunately, this flag doesn't work well on XPU. I have reported the issue to the corresponding team and hope that it could be fixed.

However, back to our PR, I think require_deterministic_for_xpu is still a better idea instead of applying this function to all tests due to its "large impact". So I updated my fix accordingly. Pls have a review.

@faaany
Copy link
Contributor Author

faaany commented May 16, 2024

There are some tests failed, but I think this should not be related to my fix.

@ydshieh
Copy link
Collaborator

ydshieh commented May 16, 2024

Hi. Just to double check, do you intend to skip those tests for XPU? If so I am fine with the changes.

@ydshieh
Copy link
Collaborator

ydshieh commented May 16, 2024

I mean, you can still try to set it to deterministic for XPU (in the decorator body) and use it (despite you might get some failures at some points as I mentioned). You might get lucky 😆

But if skip is an option for you, OK for me too!

@faaany
Copy link
Contributor Author

faaany commented May 17, 2024

Yes, I think skip is a better option here, because there are only less than 10 out of 13508 tests affected by the deterministic algorithm. I can run them separately, so other tests won't be affected. 😊

@ydshieh
Copy link
Collaborator

ydshieh commented May 21, 2024

Hi @faaany

I still have 2 questions.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @faaany . LGTM.

@ydshieh ydshieh requested a review from ArthurZucker May 21, 2024 13:45
@faaany
Copy link
Contributor Author

faaany commented May 23, 2024

I think the failing tests are not caused by my fix. Could you retrigger the CI? Thx! @ArthurZucker

@ydshieh
Copy link
Collaborator

ydshieh commented May 23, 2024

Try to rebase on main and push again 🤗

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh @faaany Thanks both for iterating on this to find a solution, and thanks again @faaany for another improvement to our test suite!

@amyeroberts amyeroberts merged commit 21339a5 into huggingface:main May 23, 2024
20 checks passed
itazap pushed a commit that referenced this pull request May 24, 2024
* add xpu check

* add marker

* add documentation

* update doc

* fix ci

* remove from global init

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants