Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Traceable FSDP][Compiled Autograd] Add queue_callback() support #126366

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

yf225
Copy link
Contributor

@yf225 yf225 commented May 16, 2024

@yf225 yf225 requested review from jansel and xmfan May 16, 2024 01:51
Copy link

pytorch-bot bot commented May 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126366

Note: Links to docs will display an error until the docs builds have been completed.

❌ 16 New Failures, 4 Unrelated Failures

As of commit 81aa586 with merge base a0429c0 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

  • pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / build (gh) (#127104)
    /var/lib/jenkins/workspace/aten/src/ATen/cuda/CUDASparseDescriptors.h:119:68: error: ‘cusparseStatus_t cusparseCreateBsrsm2Info(bsrsm2Info**)’ is deprecated: The routine will be removed in the next major release [-Werror=deprecated-declarations]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_dynamo/external_utils.py Outdated Show resolved Hide resolved
@pytorch-bot pytorch-bot bot added the release notes: fx release notes category label May 17, 2024
@yf225 yf225 requested a review from jansel May 17, 2024 07:02
Copy link
Member

@xmfan xmfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these final callbacks in the graph? or just need them to execute at the end. In the second case, we can execute them on the c++ side just like eager does

torch/csrc/autograd/python_engine.cpp Outdated Show resolved Hide resolved
@yf225
Copy link
Contributor Author

yf225 commented May 17, 2024

do we need these final callbacks in the graph? or just need them to execute at the end. In the second case, we can execute them on the c++ side just like eager does

yes we need Dynamo to trace through those final callbacks, so they need to be in the graph

@yf225 yf225 requested a review from xmfan May 17, 2024 21:12
Copy link
Member

@xmfan xmfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid the globals on OutputGraph if we construct the CompiledAutogradEngine within the graph

Would be good to double check:

  • Graph break between queue_callback and exec_final_callbacks
  • Memory check for tensors involved the callback

torch/_dynamo/output_graph.py Outdated Show resolved Hide resolved
test/test_autograd.py Outdated Show resolved Hide resolved
torch/_dynamo/external_utils.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants