You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average.
硬件:H800 PCIE * 8
我使用vllm推理最多只能达到1500tokens/s,batch_size为1024,请问怎样才能达到论文里说的50000多tokens?
The text was updated successfully, but these errors were encountered: