-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests] add torch.use_deterministic_algorithms
for XPU
#30774
Conversation
Thanks @faaany ! It's probably better to set this in a global way IMO, i.e. set it in a single place (only when we launch tests on a XPU device)? WDYT? |
Hi @ydshieh, that's correct! I almost forgot the |
Hi @ydshieh , I just finished running all tests on XPU with |
Could you share what kinds of errors we have if we add |
sure, different kinds of errors, e.g. RuntimeError: linearIndex.numel() * sliceSize * nElemBefore == expandedValue.numel() INTERNAL ASSERT FAILED at "/build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.cpp":1289, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor: 160 vs 5 RuntimeError: grid_sampler_2d_backward_xpu does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation. RuntimeError: scatter_add_dpcpp_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation. |
Some errors happen at "Variable._execution_engine.run_backward()" and the majority errors happen because shape don't match. |
Hi @faaany I would need a bit more context (in particular I don't have a XPU device to test) 🙏 So I am wondering why adding From the description, IIRC, those extra tests (failed with |
Hi @ydshieh, I tried to add E RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility e.g. pytest tests/models/tapas/test_modeling_tapas.py::TapasModelTest::test_problem_types -rA
pytest tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_for_question_answering -rA
pytest tests/models/encoder_decoder/test_modeling_encoder_decoder.py::RoBertaEncoderDecoderModelTest -rA And my import torch
DEVICE_NAME = "cuda"
torch.use_deterministic_algorithms(True)
MANUAL_SEED_FN = torch.cuda.manual_seed
EMPTY_CACHE_FN = torch.cuda.empty_cache
DEVICE_COUNT_FN = torch.cuda.device_count The only difference is that the error messages in CUDA are consistent saying that this operation is not deterministic and users should not set deterministic=True, while on XPU oneDNN also mentions the concrete operation that are not supported. Since it is up to libraries whether certain operation should be deterministic by default (cuDNN and oneDNN in our case) and it is not ok to add this flag to a global place, I think the fix in this PR should be fine. WDYT? |
Hi @faaany Thanks for sharing. I am open to what is proposed in this PR, but I have one more question. If we do if "xpu" in torch_device:
torch.use_deterministic_algorithms(True) in a test method, it actually changes torch's property once after. All the subsequent tests run after that test will always have We might not see that failures due to the luck (depending on the order of tests being run) but there is no guarantee. Since this is currently only XPU related, it's fine from my side if you want to move forward with this change. The above is just FYI (but I might be wrong) However I would prefer to use a decorator like
and we just use decorator on some methods that requires it. |
Hi @ydshieh , you are right! This function will have a global impact. If I put it in a test method, the following tests will be affected. Furthermore, I found that the function However, back to our PR, I think |
There are some tests failed, but I think this should not be related to my fix. |
Hi. Just to double check, do you intend to skip those tests for XPU? If so I am fine with the changes. |
I mean, you can still try to set it to deterministic for XPU (in the decorator body) and use it (despite you might get some failures at some points as I mentioned). You might get lucky 😆 But if skip is an option for you, OK for me too! |
Yes, I think skip is a better option here, because there are only less than 10 out of 13508 tests affected by the deterministic algorithm. I can run them separately, so other tests won't be affected. 😊 |
Hi @faaany I still have 2 questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @faaany . LGTM.
I think the failing tests are not caused by my fix. Could you retrigger the CI? Thx! @ArthurZucker |
Try to rebase on main and push again 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* add xpu check * add marker * add documentation * update doc * fix ci * remove from global init * fix
What does this PR do?
There are tests that compare numerical difference of inference results between 2 separate model runs. These tests fail on XPU, because deterministic is not the default mode in oneDNN. Below is an example:
Although the difference in this example is from precision level of 1e-8, it will cause some some model tests fail. This PR fixes these failed tests by add
torch.use_deterministic_algorithms(True)
.@ArthurZucker and @amyeroberts