-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deploy in VLLM? #7
Comments
Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready. |
Can it support quantitative deployment? GPTQ or AWQ? |
hi, we have support vllm in this pr(vllm-project/vllm#4650) |
Thank you for your great work. According to your document description: the actual deployment on an 8*H800 machine has an input throughput of more than 100,000 tokens/s and an output throughput of more than 50,000 tokens/s . Can we achieve such excellent performance with this vllm? |
Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM. |
8x80G,8*40G only work for 4bit model |
got it, thank you~ |
4bit model ? we don't get it |
I failed to use vllm 0.4.2 for inference and reported the following error: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. |
Same problem Solved by checking the |
No description provided.
The text was updated successfully, but these errors were encountered: