[core][experimental] Support nested and dynamically sized GPU tensors via NCCL #45306

stephanie-wang · 2024-05-13T22:28:49Z

Description

#45092 introduces statically sized p2p transfer for GPU tensors via NCCL. We should also support GPU tensors that are stored inside other data on host memory, as well as dynamically sized tensors.

Use case

No response

…passed via NCCL in accelerated DAG (#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also #45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of #45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>

anyscalesam · 2024-05-23T22:06:19Z

@stephanie-wang is this something that is needed for integration with vLLM or is this a usability item to make it more natural for NCCL users to convert to ADAG?

stephanie-wang · 2024-05-23T22:09:57Z

It is both!

A lighterweight integration with vLLM is possible without this, but supporting this would mean that GPU-GPU communication can also be done through DAGs (instead of user code).

…ython objects (#45473) Allows torch.Tensors nested inside Python objects to be transferred via NCCL using the following syntax: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) ``` We implement this by using an additional shared memory channel to pass CPU data, with a "nested" NCCL channel to pass the GPU data. Here is the send procedure for the above code: 1. Serialize the data. Extract out all tensors that are on the GPU and replace them with some kind of placeholder. 2. Send a list of metadata through the meta_channel. 3. Send the GPU tensors through the NCCL channel. 4. Send the rest of the CPU data through a cpu_data_channel, with the placeholders for the GPU tensors. Note that if the TorchTensorType doesn't have a shape and dtype specified, we currently use the separate meta_channel to pass metadata for the serialized tensors, as introduced in #45332. To elide the cpu_data_channel, the user should now use `TorchTensorType(direct_return=True)`, to indicate that no CPU data is sent along with the GPU data. To elide the meta_channel, the user should declare the shape and dtype, e.g., `TorchTensorType(shape=(10, ), dtype=torch.float16)`. ## Related issue number Closes #45306. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com>

stephanie-wang added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2024

stephanie-wang self-assigned this May 13, 2024

This was referenced May 13, 2024

[core][experimental] Meta-issue: Support transferring GPU tensors in accelerated DAG #43830

Open

[core][experimental] Add support for dynamically sized torch.Tensors passed via NCCL in accelerated DAG #45332

Merged

anyscalesam added the core Issues that should be addressed in Ray Core label May 20, 2024

rynewang added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024

stephanie-wang mentioned this issue May 21, 2024

[core][experimental] Support NCCL-based torch.Tensors nested inside Python objects #45473

Merged

8 tasks

stephanie-wang closed this as completed in #45473 May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][experimental] Support nested and dynamically sized GPU tensors via NCCL #45306

[core][experimental] Support nested and dynamically sized GPU tensors via NCCL #45306

stephanie-wang commented May 13, 2024

anyscalesam commented May 23, 2024

stephanie-wang commented May 23, 2024

[core][experimental] Support nested and dynamically sized GPU tensors via NCCL #45306

[core][experimental] Support nested and dynamically sized GPU tensors via NCCL #45306

Comments

stephanie-wang commented May 13, 2024

Description

Use case

anyscalesam commented May 23, 2024

stephanie-wang commented May 23, 2024