Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept #126376

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wizzniu
Copy link

@wizzniu wizzniu commented May 16, 2024

This PR re-implements pin memory aiming to get rid of the optional device argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in AcceleratorHooksInterface and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses getAcceleratorHooksInterface in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.

Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the isPinnedPtr and getPinnedMemoryAllocator methods.

Additional context: To avoid BC-breaking, this PR just preserves the device arg of related APIs and would throw a deprecation warning if device arg is passed. Another PR will be submitted to update all PT callers (Tensor.is_pinned(), Tensor.pin_memory()...) not to pass this arg based on this PR. In future, device arg will be actually removed.

cc @albanD @ezyang

Relates #124908
Relates #14560

Copy link

pytorch-bot bot commented May 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126376

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 50b1875 with merge base 2cb6f20 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: mps Release notes category label May 16, 2024
@wizzniu
Copy link
Author

wizzniu commented May 16, 2024

@albanD @ezyang Could you help to review this? If it's reasonable, I will go on for the next step to refresh all related APIs and modify the test cases.

return self;
}
return at::_pin_memory(self, device);
TORCH_CHECK(self.device().is_cpu(), "cannot pin '", self.toString(), "' only dense CPU tensors can be pinned");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use _pin_memory here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I noticed that the different backends' implement of _pin_memory is mainly at host side and the code's logic is common, where the main difference is in how to get their own Allocator. So I remove function _pin_memory and don't go a redispatching way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!
I'm afraid you're going to have to keep _pin_memory and the reason is from what is happening in native_functions.yaml.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, have added it back.

@ezyang
Copy link
Contributor

ezyang commented May 17, 2024

@albanD for you

@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024
@@ -39,6 +40,15 @@ struct TORCH_API AcceleratorHooksInterface {
TORCH_CHECK(false, "Backend doesn't support maybeExchangeDevice()");
return -1;
}

virtual bool isPinnedPtr(const void* /*data*/) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is data not a named argument here and for all subclasses?
We can mark it unused here if we need to appease compiler warning

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both is okay. Have used named argument instead, also for all subclasses.


namespace at::mps {

bool _is_pinned_ptr(const void* data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new file for this, can this be placed next to getIMPSAllocator() ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really not necessary. Have removed this file and place it next to getIMPSAllocator() in aten/src/ATen/mps/MPSAllocatorInterface.h.

return self;
}
return at::_pin_memory(self, device);
TORCH_CHECK(self.device().is_cpu(), "cannot pin '", self.toString(), "' only dense CPU tensors can be pinned");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!
I'm afraid you're going to have to keep _pin_memory and the reason is from what is happening in native_functions.yaml.


# TODO: add a copy kwarg that guarantees that the tensor is put into fresh
# pinned memory
- func: pin_memory(Tensor(a) self, Device? device=None) -> Tensor(a)
- func: pin_memory(Tensor(a) self) -> Tensor(a)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function's implementation is NOT Composite. So we have to add a CompositeExplicitAutograd: pin_memory here.
This will lead to the fact that you must add back the derivatives formula and the aliasing information must be accurate.

Since this function sometimes return a view and sometimes not, you won't be able to have the aliasing info correctly and will have to do the same trick as before:
pin_memory remain compositeimplicit (no specific dispatch for it) and can have inncurate aliasing info as before.
You have a _pin_memory that is never aliasing that is compositeexplicitautograd that is accurate and has a derivatives.yaml formula.

tl;dr: you have to restore the _pin_memory trick I'm afraid to have proper autograd/aliasing behavior.
Also since you have to keep that, you might want to keep the _pin_memory_nested as well on top of the compositeexplicitimplementation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I ignored the aliasing information before. And it's difficult to use CompositeExplicitAutograd: pin_memory here.
So I have restored the _pin_memory trick and use CompositeImplicitAutograd: pin_memory to guarantee proper autograd behavior.
Also for _pin_memory, keep the _pin_memory_nested on top of the CompositeExplicitAutograd. Here, I only reserve NestedTensorCPU key, cause manual backendselect for _pin_memory is removed and input tensor self can be performed pin_memory operation only if it's a cpu tensor. So unlike before, NestedTensorCUDA key and other nested keys are just meaningless for _pin_memory now. When inputting a nested cuda tensor, the behavior changes from throwing a RuntimeError before("only dense CPU tensors can be pinned") to throwing a NotImplementedError now ("Could not run aten::_pin_memory with arguments from the NestedTensorCUDA backend").

@@ -4526,26 +4526,14 @@
CPU: channel_shuffle_cpu
CompositeImplicitAutograd: math_channel_shuffle

- func: is_pinned(Tensor self, Device? device=None) -> bool
- func: is_pinned(Tensor self) -> bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be badly BC-breaking I'm afraid.
To make this PR landing more easily, I would preserve the device arg here and continue to respect it. Only change the default to point to the accelerator.

We can then have 3 follow ups:

  • Update all pt callers not to pass this arg
  • Throw a deprecation warning if this arg is passed
  • After 2 releases, actually remove the arg.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you.

  • If low level implement of pin_memory is acceptable, I will go on for the next step to update all PT callers of related APIs.
  • Have added deprecation warnings in C++ side. But I'm not sure if it's enough, WDYT?

Copy link
Collaborator

@albanD albanD May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,
Will take a look at the code.
By what I meant by 3 follow ups here meant that this can be 3 different PRs (doesn't have to be).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
As what you said, to make this PR landing more easily, this PR will only refactor implement of pin memory and add a deprecation warning for device arg. I will submit another PR to update all PT callers based on this PR.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!
This needs rebasing on latest main so that we get CI signal btw!

@@ -20,7 +22,7 @@ namespace at {
// which we may want to call into from CPU code (and thus must be dynamically
// dispatched, to allow for separate compilation of HIP code). See
// CUDAHooksInterface for more detailed motivation.
struct TORCH_API HIPHooksInterface {
struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @jeffdaily this makes HIP an "accelerator". I'm still not sure if you use it but that's just enabling more device-generic feature for the HIP device so I guess you're happy with it. We can remove it if you are not!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we don't use HIPHooksInterface but rather a hipified version of the CUDAHooksInterface. In any case, we're okay with being an Accelerator. Is there an RFC or something similar describing pytorch's move to these generic interfaces?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false;
}

virtual Allocator* getPinnedMemoryAllocator() const override {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @egienvalue you can implement this if/when you need to support pinned host-side memory used for faster transfers to the device!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is nice. Right now we have a hacky way to pin CPU tensors.

@wizzniu
Copy link
Author

wizzniu commented May 31, 2024

@pytorchbot merge

Copy link

pytorch-bot bot commented May 31, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@wizzniu
Copy link
Author

wizzniu commented May 31, 2024

@albanD Have rebased. Can we merge it now?

Comment on lines 4531 to -4535
variants: method
dispatch:
NestedTensorCUDA, CUDA: is_pinned_cuda
MPS: is_pinned_mps
CompositeExplicitAutograd: is_pinned_default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho sorry for the oversight on my end.
The implementation here is not a valid CompositeImplicitAutograd implementation (the default key when nothing is specified as it calls into non-aten ops (the context).

Please update to


  dispatch:
    CompositeExplicitAutograd: is_pinned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open source release notes: mps Release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants