-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319
Labels
accelerated-dag
bug
Something that is supposed to be working; but isn't
P1
Issue that should be fixed within a few weeks
performance
Milestone
Comments
stephanie-wang
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
P1
Issue that should be fixed within a few weeks
accelerated-dag
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 14, 2024
8 tasks
stephanie-wang
added a commit
that referenced
this issue
May 20, 2024
…passed via NCCL in accelerated DAG (#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also #45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of #45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Discussed during standup today > goal is to improve the NCCL performance to be within 50% versus 4x. This will also help with vLLM performance (may or may not impact) but it will also draw that closer as well. |
8 tasks
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this issue
Jun 6, 2024
…passed via NCCL in accelerated DAG (ray-project#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also ray-project#45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of ray-project#45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this issue
Jun 6, 2024
…passed via NCCL in accelerated DAG (ray-project#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also ray-project#45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of ray-project#45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this issue
Jun 7, 2024
…passed via NCCL in accelerated DAG (ray-project#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also ray-project#45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of ray-project#45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Environment setup - about to dive into debugging the overhead. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
accelerated-dag
bug
Something that is supposed to be working; but isn't
P1
Issue that should be fixed within a few weeks
performance
What happened + What you expected to happen
Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.
However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.
Versions / Dependencies
3.0dev
Reproduction script
See linked microbenchmarks.
Issue Severity
None
The text was updated successfully, but these errors were encountered: