-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Traceable FSDP][Compiled Autograd] Add queue_callback() support #126366
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126366
Note: Links to docs will display an error until the docs builds have been completed. ❌ 16 New Failures, 4 Unrelated FailuresAs of commit 81aa586 with merge base a0429c0 (): NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need these final callbacks in the graph? or just need them to execute at the end. In the second case, we can execute them on the c++ side just like eager does
yes we need Dynamo to trace through those final callbacks, so they need to be in the graph |
…fsdp_queue_callback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid the globals on OutputGraph if we construct the CompiledAutogradEngine within the graph
Would be good to double check:
- Graph break between queue_callback and exec_final_callbacks
- Memory check for tensors involved the callback
Adds support for
Variable._execution_engine.queue_callback()
, which is used in FSDP2.cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang