[core][experimental] Support broadcast NCCL ops in accelerated DAG #45308
Labels
accelerated-dag
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
performance
usability
Milestone
Description
When the same GPU tensor is sent to multiple readers, we should use ncclBroadcast under the hood to reduce transfer time.
Use case
No response
The text was updated successfully, but these errors were encountered: