[tests] add `torch.use_deterministic_algorithms` for XPU #30774

faaany · 2024-05-13T12:22:11Z

What does this PR do?

There are tests that compare numerical difference of inference results between 2 separate model runs. These tests fail on XPU, because deterministic is not the default mode in oneDNN. Below is an example:

import torch
import intel_extension_for_pytorch
from transformers import set_seed

set_seed(20)
model = torch.nn.Linear(768, 768, bias=True).to("xpu")

hs = torch.randn(768, 768).to("xpu")

query_states = model(hs)
query_states2 = model(hs)

re = (query_states == query_states2).all().item()
print(re)

# on XPU
>>>> False
# on CUDA&CPU
>>>> True

Although the difference in this example is from precision level of 1e-8, it will cause some some model tests fail. This PR fixes these failed tests by add torch.use_deterministic_algorithms(True).

@ArthurZucker and @amyeroberts

ydshieh · 2024-05-13T12:39:49Z

Thanks @faaany !

It's probably better to set this in a global way IMO, i.e. set it in a single place (only when we launch tests on a XPU device)? WDYT?

faaany · 2024-05-13T13:44:33Z

Hi @ydshieh, that's correct! I almost forgot the spec.py file. I will put the function into spec.py and close this PR. Thanks for the review!

faaany · 2024-05-13T17:06:21Z

Hi @ydshieh , I just finished running all tests on XPU with torch.use_deterministic_algorithms(True) in spec.py and I am afraid that I have to add it in individual tests, because adding it in a global place will cause a lot of other tests fail.

ydshieh · 2024-05-13T17:44:02Z

@faaany

Could you share what kinds of errors we have if we add torch.use_deterministic_algorithms(True) in spec.py?

faaany · 2024-05-13T18:09:29Z

@faaany

Could you share what kinds of errors we have if we add torch.use_deterministic_algorithms(True) in spec.py?

sure, different kinds of errors, e.g.

RuntimeError: linearIndex.numel() * sliceSize * nElemBefore == expandedValue.numel() INTERNAL ASSERT FAILED at "/build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.cpp":1289, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor: 160 vs 5

RuntimeError: grid_sampler_2d_backward_xpu does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

RuntimeError: scatter_add_dpcpp_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

faaany · 2024-05-13T18:11:41Z

Some errors happen at "Variable._execution_engine.run_backward()" and the majority errors happen because shape don't match.

ydshieh · 2024-05-14T09:24:13Z

Hi @faaany I would need a bit more context (in particular I don't have a XPU device to test) 🙏

So I am wondering why adding torch.use_deterministic_algorithms(True) globally will make other (more) tests fail? Are such failures expected (i.e. are they known to not working with torch.use_deterministic_algorithms(True)?).

From the description, IIRC, those extra tests (failed with torch.use_deterministic_algorithms(True)) would pass if use_deterministic_algorithms is not set to True?

faaany · 2024-05-16T02:30:22Z

Hi @ydshieh, I tried to add torch.use_deterministic_algorithms(True) in CUDA environment and those tests that are failed in XPU environment also fail in CUDA with the following message:

E       RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

e.g.

pytest tests/models/tapas/test_modeling_tapas.py::TapasModelTest::test_problem_types -rA
pytest tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_for_question_answering -rA
pytest tests/models/encoder_decoder/test_modeling_encoder_decoder.py::RoBertaEncoderDecoderModelTest -rA

And my spec.py is like this:

import torch
DEVICE_NAME = "cuda"
torch.use_deterministic_algorithms(True)
MANUAL_SEED_FN = torch.cuda.manual_seed
EMPTY_CACHE_FN = torch.cuda.empty_cache
DEVICE_COUNT_FN = torch.cuda.device_count

The only difference is that the error messages in CUDA are consistent saying that this operation is not deterministic and users should not set deterministic=True, while on XPU oneDNN also mentions the concrete operation that are not supported.

Since it is up to libraries whether certain operation should be deterministic by default (cuDNN and oneDNN in our case) and it is not ok to add this flag to a global place, I think the fix in this PR should be fine. WDYT?

ydshieh · 2024-05-16T06:54:42Z

Hi @faaany

Thanks for sharing. I am open to what is proposed in this PR, but I have one more question. If we do

        if "xpu" in torch_device:
            torch.use_deterministic_algorithms(True)

in a test method, it actually changes torch's property once after. All the subsequent tests run after that test will always have deterministic being True and should have the same issue as if we set in the first place.

We might not see that failures due to the luck (depending on the order of tests being run) but there is no guarantee.

Since this is currently only XPU related, it's fine from my side if you want to move forward with this change. The above is just FYI (but I might be wrong)

However I would prefer to use a decorator like require_deterministic_for_xpu and it does

         if "xpu" in torch_device:
            torch.use_deterministic_algorithms(True)

and we just use decorator on some methods that requires it.

faaany · 2024-05-16T10:05:32Z

Hi @ydshieh , you are right! This function will have a global impact. If I put it in a test method, the following tests will be affected.

Furthermore, I found that the function torch.use_deterministic_algorithms() have a flag called warn_only. So if I use torch.use_deterministic_algorithms(mode=True, warn_only=True) on CUDA, the failed tests I mentioned earlier will give a warning instead of failing. Unfortunately, this flag doesn't work well on XPU. I have reported the issue to the corresponding team and hope that it could be fixed.

However, back to our PR, I think require_deterministic_for_xpu is still a better idea instead of applying this function to all tests due to its "large impact". So I updated my fix accordingly. Pls have a review.

faaany · 2024-05-16T10:42:35Z

There are some tests failed, but I think this should not be related to my fix.

ydshieh · 2024-05-16T11:54:01Z

Hi. Just to double check, do you intend to skip those tests for XPU? If so I am fine with the changes.

ydshieh · 2024-05-16T11:55:33Z

I mean, you can still try to set it to deterministic for XPU (in the decorator body) and use it (despite you might get some failures at some points as I mentioned). You might get lucky 😆

But if skip is an option for you, OK for me too!

faaany · 2024-05-17T00:58:14Z

Yes, I think skip is a better option here, because there are only less than 10 out of 13508 tests affected by the deterministic algorithm. I can run them separately, so other tests won't be affected. 😊

src/transformers/testing_utils.py

src/transformers/utils/import_utils.py

ydshieh · 2024-05-21T12:39:11Z

Hi @faaany

I still have 2 questions.

ydshieh

Thanks again @faaany . LGTM.

faaany · 2024-05-23T11:43:19Z

I think the failing tests are not caused by my fix. Could you retrigger the CI? Thx! @ArthurZucker

ydshieh · 2024-05-23T13:25:52Z

Try to rebase on main and push again 🤗

amyeroberts

@ydshieh @faaany Thanks both for iterating on this to find a solution, and thanks again @faaany for another improvement to our test suite!

* add xpu check * add marker * add documentation * update doc * fix ci * remove from global init * fix

add xpu check

489a8b1

amyeroberts requested a review from ydshieh May 13, 2024 12:30

faaany closed this May 13, 2024

faaany reopened this May 13, 2024

ydshieh self-assigned this May 13, 2024

faaany added 5 commits May 16, 2024 03:06

add marker

23d5cca

add documentation

c5fa129

update doc

2e23220

fix ci

8f60dfe

remove from global init

82d66d0

ydshieh reviewed May 21, 2024

View reviewed changes

src/transformers/testing_utils.py Show resolved Hide resolved

ydshieh reviewed May 21, 2024

View reviewed changes

src/transformers/utils/import_utils.py Show resolved Hide resolved

faaany and others added 2 commits May 21, 2024 06:35

fix

2e639f8

Merge branch 'huggingface:main' into determ

15dcaba

ydshieh approved these changes May 21, 2024

View reviewed changes

ydshieh requested a review from ArthurZucker May 21, 2024 13:45

Merge branch 'huggingface:main' into determ

df06e38

amyeroberts approved these changes May 23, 2024

View reviewed changes

amyeroberts merged commit 21339a5 into huggingface:main May 23, 2024
20 checks passed

itazap pushed a commit that referenced this pull request May 24, 2024

[tests] add torch.use_deterministic_algorithms for XPU (#30774)

b2a4471

* add xpu check * add marker * add documentation * update doc * fix ci * remove from global init * fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests] add `torch.use_deterministic_algorithms` for XPU #30774

[tests] add `torch.use_deterministic_algorithms` for XPU #30774

faaany commented May 13, 2024

ydshieh commented May 13, 2024 •

edited

faaany commented May 13, 2024

faaany commented May 13, 2024

ydshieh commented May 13, 2024

faaany commented May 13, 2024

faaany commented May 13, 2024 •

edited

ydshieh commented May 14, 2024 •

edited

faaany commented May 16, 2024 •

edited

ydshieh commented May 16, 2024 •

edited

faaany commented May 16, 2024

faaany commented May 16, 2024

ydshieh commented May 16, 2024

ydshieh commented May 16, 2024

faaany commented May 17, 2024 •

edited

ydshieh commented May 21, 2024

ydshieh left a comment

faaany commented May 23, 2024

ydshieh commented May 23, 2024

amyeroberts left a comment

[tests] add torch.use_deterministic_algorithms for XPU #30774

[tests] add torch.use_deterministic_algorithms for XPU #30774

Conversation

faaany commented May 13, 2024

What does this PR do?

ydshieh commented May 13, 2024 • edited

faaany commented May 13, 2024

faaany commented May 13, 2024

ydshieh commented May 13, 2024

faaany commented May 13, 2024

faaany commented May 13, 2024 • edited

ydshieh commented May 14, 2024 • edited

faaany commented May 16, 2024 • edited

ydshieh commented May 16, 2024 • edited

faaany commented May 16, 2024

faaany commented May 16, 2024

ydshieh commented May 16, 2024

ydshieh commented May 16, 2024

faaany commented May 17, 2024 • edited

ydshieh commented May 21, 2024

ydshieh left a comment

Choose a reason for hiding this comment

faaany commented May 23, 2024

ydshieh commented May 23, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

[tests] add `torch.use_deterministic_algorithms` for XPU #30774

[tests] add `torch.use_deterministic_algorithms` for XPU #30774

ydshieh commented May 13, 2024 •

edited

faaany commented May 13, 2024 •

edited

ydshieh commented May 14, 2024 •

edited

faaany commented May 16, 2024 •

edited

ydshieh commented May 16, 2024 •

edited

faaany commented May 17, 2024 •

edited