Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何能达到论文里说的吞吐量50000多tokens #35

Open
ly19970621 opened this issue May 17, 2024 · 3 comments
Open

如何能达到论文里说的吞吐量50000多tokens #35

ly19970621 opened this issue May 17, 2024 · 3 comments

Comments

@ly19970621
Copy link

硬件:H800 PCIE * 8
我使用vllm推理最多只能达到1500tokens/s,batch_size为1024,请问怎样才能达到论文里说的50000多tokens?

@haichuan1221
Copy link

haichuan1221 commented May 19, 2024

你好,vllm是否能够跑起来呢? 是否有做量化呢? 另外PCIE的带宽比较低,做tensor parallel的话,可能会比较慢; 论文里面的H100多半是nvlink连接的8卡主机

硬件:H800 PCIE * 8 我使用vllm推理最多只能达到1500tokens/s,batch_size为1024,请问怎样才能达到论文里说的50000多tokens?

@ly19970621
Copy link
Author

你好,vllm是否能够跑起来呢? 是否有做量化呢? 另外PCIE的带宽比较低,做tensor parallel的话,可能会比较慢; 论文里面的H100多半是nvlink连接的8卡主机

硬件:H800 PCIE * 8 我使用vllm推理最多只能达到1500tokens/s,batch_size为1024,请问怎样才能达到论文里说的50000多tokens?
就是使用vllm跑的,还要专门做量化嘛?
如果需要量化的话,可以开源量化后的模型嘛?或者提供一下量化方式,是AWQ还是GPTQ?
对于并行方式,推理是选择张量并行还是流水线并行?
另外我在8卡SXM(nvlink)的A800跑也是1500tokens/s,一样用得vllm,每个卡之间的网络带宽是400GB。

@luofuli
Copy link
Member

luofuli commented May 27, 2024

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants