[Traceable FSDP][Compiled Autograd] Add queue_callback() support #126366

yf225 · 2024-05-16T01:51:12Z

Adds support for Variable._execution_engine.queue_callback(), which is used in FSDP2.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-05-16T01:51:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126366

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 16 New Failures, 4 Unrelated Failures

As of commit 81aa586 with merge base a0429c0 ():

NEW FAILURES - The following jobs have failed:

inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_BigBird
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_BigBird
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_BigBird
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_dynamo/variables/misc.py:
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_module_backward_hooks_aot
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_module_backward_hooks_aot
pull / linux-focal-cuda12.4-py3.10-gcc9 / build (gh)
/var/lib/jenkins/workspace/aten/src/ATen/cuda/CUDASparseDescriptors.h:119:68: error: ‘cusparseStatus_t cusparseCreateBsrsm2Info(bsrsm2Info**)’ is deprecated: The routine will be removed in the next major release [-Werror=deprecated-declarations]
pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_module_backward_hooks_aot
pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
test_autograd.py::TestAutograd::test_reentrant_with_callbacks_both_depths
pull / linux-focal-py3.12-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_module_backward_hooks_aot
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
test_autograd.py::TestAutograd::test_reentrant_with_callbacks_both_depths
pull / linux-focal-py3.8-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_module_backward_hooks_aot
pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
test_autograd.py::TestAutograd::test_reentrant_with_callbacks_both_depths
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_module_backward_hooks_aot
pull / linux-jammy-py3.8-gcc11 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_module_backward_hooks_aot
pull / win-vs2019-cpu-py3 / build (gh)
sccache: error: couldn't connect to server

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / build (gh) (#127104)
/var/lib/jenkins/workspace/aten/src/ATen/cuda/CUDASparseDescriptors.h:119:68: error: ‘cusparseStatus_t cusparseCreateBsrsm2Info(bsrsm2Info**)’ is deprecated: The routine will be removed in the next major release [-Werror=deprecated-declarations]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_dynamo/external_utils.py

xmfan

do we need these final callbacks in the graph? or just need them to execute at the end. In the second case, we can execute them on the c++ side just like eager does

torch/csrc/autograd/python_engine.cpp

yf225 · 2024-05-17T21:12:21Z

do we need these final callbacks in the graph? or just need them to execute at the end. In the second case, we can execute them on the c++ side just like eager does

yes we need Dynamo to trace through those final callbacks, so they need to be in the graph

…fsdp_queue_callback

xmfan

I think we can avoid the globals on OutputGraph if we construct the CompiledAutogradEngine within the graph

Would be good to double check:

Graph break between queue_callback and exec_final_callbacks
Memory check for tensors involved the callback

torch/_dynamo/output_graph.py

torch/_dynamo/external_utils.py

torch/_dynamo/output_graph.py

test/test_autograd.py

torch/_dynamo/external_utils.py

[Traceable FSDP][Compiled Autograd] Add queue_callback() support

6a7dcb2

yf225 requested review from jansel and xmfan May 16, 2024 01:51

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels May 16, 2024

yf225 added 2 commits May 15, 2024 19:35

fix

6572c9a

lint

64ecf14

jansel requested changes May 16, 2024

View reviewed changes

torch/_dynamo/external_utils.py Outdated Show resolved Hide resolved

more exp on queue_callback

0f82b63

yf225 requested review from albanD and soulitzer as code owners May 17, 2024 06:52

pytorch-bot bot added the release notes: fx release notes category label May 17, 2024

yf225 added 2 commits May 16, 2024 23:59

clean up

6936fa4

clean

a33b64a

yf225 requested a review from jansel May 17, 2024 07:02

xmfan reviewed May 17, 2024

View reviewed changes

torch/csrc/autograd/python_engine.cpp Outdated Show resolved Hide resolved

yf225 added 7 commits May 17, 2024 13:53

add CompiledAutogradEngine to external_utils.py

e07a665

clean

5c579b1

remove node.py changes:

967d605

simplify

55cbe1b

improve

4e3482e

lint

9213f34

comment

5f92a7a

yf225 requested a review from xmfan May 17, 2024 21:12

Merge branch 'main' of https://github.com/yf225/pytorch into compile_…

686901a

…fsdp_queue_callback

xmfan approved these changes May 18, 2024

View reviewed changes

xmfan reviewed May 18, 2024

View reviewed changes

torch/_dynamo/output_graph.py Outdated Show resolved Hide resolved

xmfan reviewed May 18, 2024

View reviewed changes

torch/_dynamo/external_utils.py Outdated Show resolved Hide resolved

jansel requested changes May 18, 2024

View reviewed changes

torch/_dynamo/output_graph.py Outdated Show resolved Hide resolved

albanD reviewed May 20, 2024

View reviewed changes

test/test_autograd.py Outdated Show resolved Hide resolved

torch/_dynamo/external_utils.py Outdated Show resolved Hide resolved

yf225 mentioned this pull request May 28, 2024

[DO NOT REVIEW][NOT USED] Add queue_callback support #127244

Open

yf225 added 5 commits May 28, 2024 13:04

rename to FakeCompiledAutogradEngine

e6195ef

remove list attr

54be9ba

make adding more callbacks from within a callback work

3d58b24

remove unneeded test

cc1713d

move to side_effects

81aa586

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Traceable FSDP][Compiled Autograd] Add queue_callback() support #126366

[Traceable FSDP][Compiled Autograd] Add queue_callback() support #126366

yf225 commented May 16, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 16, 2024 •

edited

xmfan left a comment

yf225 commented May 17, 2024

xmfan left a comment

[Traceable FSDP][Compiled Autograd] Add queue_callback() support #126366

Are you sure you want to change the base?

[Traceable FSDP][Compiled Autograd] Add queue_callback() support #126366

Conversation

yf225 commented May 16, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented May 16, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126366

❌ 16 New Failures, 4 Unrelated Failures

xmfan left a comment

Choose a reason for hiding this comment

yf225 commented May 17, 2024

xmfan left a comment

Choose a reason for hiding this comment

yf225 commented May 16, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 16, 2024 •

edited