Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom cpu and cuda operators support #1081

Open
K024 opened this issue Aug 19, 2023 · 2 comments
Open

Custom cpu and cuda operators support #1081

K024 opened this issue Aug 19, 2023 · 2 comments

Comments

@K024
Copy link

K024 commented Aug 19, 2023

Hi. Currently I'm trying to implement some large language models (LLM) with TorchSharp and got a nice demo (here). But when moving forward to more features I found some lacking features required for LLMs:

Custom operators

LLMs heavily depend on custom operators like flash attention, RMS norm and GPTQ int4 matmul for faster inference speed and reduced model size with quantization.

PyTorch allows defining custom operators with native c++ and cuda source files in two ways: pybind11 and torch library. The latter one seems working fine with torch.jit.script and is potentially to be working with TorchSharp torch.jit.compile and torch.ops.xxx. But loading it requires calling a torch native method. Also, TorchSharp may have some specialized modules for custom ops.

BTW, openai/triton uses MLIR and LLVM to create custom ops, but is almost bound to python.

NCCL ops

I've also tried to implement a thread-based distributed approach with TorchSharp (at here). The required communication ops are: broadcast, scatter, gather and all-gather. I'm using the naive _copy operator to implement them, but are very slow. Is it possible to have these NCCL related ops provided?

@dje-dev
Copy link

dje-dev commented Aug 21, 2023

Wow, this work looks very interesting and potentially very useful! Basic distributed training/inference (one host, multiple GPU) is currently a gap for Torchsharp and your implementation could be a step toward addressing that.

Two comments/suggestions:

  • beware that there is currently a serious problem with Torchscript execution that is likely to corrupt the memory in your process if you run for more than a minute or two (Torchscript execution failures (hangs, access violation, Fatal error. Internal CLR fatal error. (0x80131506) ) #1047). I'm hoping this can be fixed soon
  • in order to increase the chances that your request could be addressed, you might consider laying out specifically what you would need from NCCL. For example, it might be helpful if you could provide a pointer to the minimal specific set of torch APIs (in some header file) needed, and a few lines of sample C# code that would use these APIs and demonstrate that they are working correctly

@GeorgeS2019
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants