Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept #126376

wizzniu · 2024-05-16T03:49:10Z

This PR re-implements pin memory aiming to get rid of the optional device argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in AcceleratorHooksInterface and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses getAcceleratorHooksInterface in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.

Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the isPinnedPtr and getPinnedMemoryAllocator methods.

Additional context: To avoid BC-breaking, this PR just preserves the device arg of related APIs and would throw a deprecation warning if device arg is passed. Another PR will be submitted to update all PT callers (Tensor.is_pinned(), Tensor.pin_memory()...) not to pass this arg based on this PR. In future, device arg will be actually removed.

cc @albanD @ezyang

Relates #124908
Relates #14560

pytorch-bot · 2024-05-16T03:49:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126376

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

✅ No Failures

As of commit 50b1875 with merge base 2cb6f20 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wizzniu · 2024-05-16T03:51:13Z

@albanD @ezyang Could you help to review this? If it's reasonable, I will go on for the next step to refresh all related APIs and modify the test cases.

legionGIT · 2024-05-17T01:49:21Z

aten/src/ATen/native/Memory.cpp

    return self;
  }
-  return at::_pin_memory(self, device);
+  TORCH_CHECK(self.device().is_cpu(), "cannot pin '", self.toString(), "' only dense CPU tensors can be pinned");


I think it's better to use _pin_memory here.

Since I noticed that the different backends' implement of _pin_memory is mainly at host side and the code's logic is common, where the main difference is in how to get their own Allocator. So I remove function _pin_memory and don't go a redispatching way.

Hey!
I'm afraid you're going to have to keep _pin_memory and the reason is from what is happening in native_functions.yaml.

ok, have added it back.

ezyang · 2024-05-17T12:43:34Z

@albanD for you

albanD · 2024-05-22T16:56:08Z

aten/src/ATen/detail/AcceleratorHooksInterface.h

@@ -39,6 +40,15 @@ struct TORCH_API AcceleratorHooksInterface {
    TORCH_CHECK(false, "Backend doesn't support maybeExchangeDevice()");
    return -1;
  }
+
+  virtual bool isPinnedPtr(const void* /*data*/) const {


Why is data not a named argument here and for all subclasses?
We can mark it unused here if we need to appease compiler warning

I think both is okay. Have used named argument instead, also for all subclasses.

albanD · 2024-05-22T17:03:20Z

aten/src/ATen/mps/MPSPinnedMemory.h

+
+namespace at::mps {
+
+bool _is_pinned_ptr(const void* data);


Do we need a new file for this, can this be placed next to getIMPSAllocator() ?

It's really not necessary. Have removed this file and place it next to getIMPSAllocator() in aten/src/ATen/mps/MPSAllocatorInterface.h.

albanD · 2024-05-22T17:08:45Z

aten/src/ATen/native/Memory.cpp

    return self;
  }
-  return at::_pin_memory(self, device);
+  TORCH_CHECK(self.device().is_cpu(), "cannot pin '", self.toString(), "' only dense CPU tensors can be pinned");


Hey!
I'm afraid you're going to have to keep _pin_memory and the reason is from what is happening in native_functions.yaml.

albanD · 2024-05-22T17:13:02Z

aten/src/ATen/native/native_functions.yaml


 # TODO: add a copy kwarg that guarantees that the tensor is put into fresh
 # pinned memory
- func: pin_memory(Tensor(a) self, Device? device=None) -> Tensor(a)
+- func: pin_memory(Tensor(a) self) -> Tensor(a)


This function's implementation is NOT Composite. So we have to add a CompositeExplicitAutograd: pin_memory here.
This will lead to the fact that you must add back the derivatives formula and the aliasing information must be accurate.

Since this function sometimes return a view and sometimes not, you won't be able to have the aliasing info correctly and will have to do the same trick as before:
pin_memory remain compositeimplicit (no specific dispatch for it) and can have inncurate aliasing info as before.
You have a _pin_memory that is never aliasing that is compositeexplicitautograd that is accurate and has a derivatives.yaml formula.

tl;dr: you have to restore the _pin_memory trick I'm afraid to have proper autograd/aliasing behavior.
Also since you have to keep that, you might want to keep the _pin_memory_nested as well on top of the compositeexplicitimplementation

Yes, you're right. I ignored the aliasing information before. And it's difficult to use CompositeExplicitAutograd: pin_memory here.
So I have restored the _pin_memory trick and use CompositeImplicitAutograd: pin_memory to guarantee proper autograd behavior.
Also for _pin_memory, keep the _pin_memory_nested on top of the CompositeExplicitAutograd. Here, I only reserve NestedTensorCPU key, cause manual backendselect for _pin_memory is removed and input tensor self can be performed pin_memory operation only if it's a cpu tensor. So unlike before, NestedTensorCUDA key and other nested keys are just meaningless for _pin_memory now. When inputting a nested cuda tensor, the behavior changes from throwing a RuntimeError before("only dense CPU tensors can be pinned") to throwing a NotImplementedError now ("Could not run aten::_pin_memory with arguments from the NestedTensorCUDA backend").

albanD · 2024-05-22T17:19:03Z

aten/src/ATen/native/native_functions.yaml

@@ -4526,26 +4526,14 @@
    CPU: channel_shuffle_cpu
    CompositeImplicitAutograd: math_channel_shuffle

- func: is_pinned(Tensor self, Device? device=None) -> bool
+- func: is_pinned(Tensor self) -> bool


This is going to be badly BC-breaking I'm afraid.
To make this PR landing more easily, I would preserve the device arg here and continue to respect it. Only change the default to point to the accelerator.

We can then have 3 follow ups:

Update all pt callers not to pass this arg

Throw a deprecation warning if this arg is passed

After 2 releases, actually remove the arg.

Agree with you.

If low level implement of pin_memory is acceptable, I will go on for the next step to update all PT callers of related APIs.

Have added deprecation warnings in C++ side. But I'm not sure if it's enough, WDYT?

Hey,
Will take a look at the code.
By what I meant by 3 follow ups here meant that this can be 3 different PRs (doesn't have to be).

Thanks!
As what you said, to make this PR landing more easily, this PR will only refactor implement of pin memory and add a deprecation warning for device arg. I will submit another PR to update all PT callers based on this PR.

albanD

Sounds good!
This needs rebasing on latest main so that we get CI signal btw!

albanD · 2024-05-30T21:03:59Z

aten/src/ATen/detail/HIPHooksInterface.h

@@ -20,7 +22,7 @@ namespace at {
 // which we may want to call into from CPU code (and thus must be dynamically
 // dispatched, to allow for separate compilation of HIP code).  See
 // CUDAHooksInterface for more detailed motivation.
-struct TORCH_API HIPHooksInterface {
+struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {


FYI @jeffdaily this makes HIP an "accelerator". I'm still not sure if you use it but that's just enabling more device-generic feature for the HIP device so I guess you're happy with it. We can remove it if you are not!

IIRC we don't use HIPHooksInterface but rather a hipified version of the CUDAHooksInterface. In any case, we're okay with being an Accelerator. Is there an RFC or something similar describing pytorch's move to these generic interfaces?

The short version is described in https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/DeviceAccelerator.h

albanD · 2024-05-30T21:04:51Z

aten/src/ATen/detail/MTIAHooksInterface.h

+    return false;
+  }
+
+  virtual Allocator* getPinnedMemoryAllocator() const override {


FYI @egienvalue you can implement this if/when you need to support pinned host-side memory used for faster transfers to the device!

That is nice. Right now we have a hacky way to pin CPU tensors.

…ry back

wizzniu · 2024-05-31T08:47:12Z

@pytorchbot merge

pytorch-bot · 2024-05-31T08:47:17Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

wizzniu · 2024-05-31T08:51:06Z

@albanD Have rebased. Can we merge it now?

albanD · 2024-05-31T14:28:06Z

aten/src/ATen/native/native_functions.yaml

  variants: method
-  dispatch:
-    NestedTensorCUDA, CUDA: is_pinned_cuda
-    MPS: is_pinned_mps
-    CompositeExplicitAutograd: is_pinned_default


Ho sorry for the oversight on my end.
The implementation here is not a valid CompositeImplicitAutograd implementation (the default key when nothing is specified as it calls into non-aten ops (the context).

Please update to

dispatch: CompositeExplicitAutograd: is_pinned

wizzniu requested review from eqy, egienvalue, kulinseth, malfet, albanD and soulitzer as code owners May 16, 2024 03:49

pytorch-bot bot added the release notes: mps Release notes category label May 16, 2024

pytorchbot added the open source label May 16, 2024

wizzniu mentioned this pull request May 16, 2024

[DataLoader] Select available CUDA or 3rd devices automatically to pin memory #125016

Open

legionGIT reviewed May 17, 2024

View reviewed changes

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024

albanD reviewed May 22, 2024

View reviewed changes

albanD approved these changes May 30, 2024

View reviewed changes

wizzniu added 2 commits May 31, 2024 11:02

Reimplement pin_memory to remove device argument

54862e7

Restore device argument, throw deprecation warnings and add _pin_memo…

50b1875

…ry back

wizzniu force-pushed the new_pin_memory branch from 13a84a9 to 50b1875 Compare May 31, 2024 05:05

albanD reviewed May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept #126376

Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept #126376

wizzniu commented May 16, 2024 •

edited

pytorch-bot bot commented May 16, 2024 •

edited

wizzniu commented May 16, 2024

legionGIT May 17, 2024

wizzniu May 17, 2024

albanD May 22, 2024

wizzniu May 24, 2024

ezyang commented May 17, 2024

albanD May 22, 2024

wizzniu May 24, 2024

albanD May 22, 2024

wizzniu May 24, 2024

albanD May 22, 2024

albanD May 22, 2024

wizzniu May 24, 2024

albanD May 22, 2024

wizzniu May 24, 2024

albanD May 24, 2024 •

edited

wizzniu May 27, 2024

albanD left a comment •

edited

albanD May 30, 2024

jeffdaily May 30, 2024

albanD May 30, 2024

albanD May 30, 2024

egienvalue May 31, 2024 •

edited

wizzniu commented May 31, 2024

pytorch-bot bot commented May 31, 2024

wizzniu commented May 31, 2024

albanD May 31, 2024

Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept #126376

Are you sure you want to change the base?

Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept #126376

Conversation

wizzniu commented May 16, 2024 • edited

pytorch-bot bot commented May 16, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126376

❗ 1 Active SEVs

✅ No Failures

wizzniu commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented May 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albanD May 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albanD left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egienvalue May 31, 2024 • edited

Choose a reason for hiding this comment

wizzniu commented May 31, 2024

pytorch-bot bot commented May 31, 2024

wizzniu commented May 31, 2024

Choose a reason for hiding this comment

wizzniu commented May 16, 2024 •

edited

pytorch-bot bot commented May 16, 2024 •

edited

albanD May 24, 2024 •

edited

albanD left a comment •

edited

egienvalue May 31, 2024 •

edited