Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PromptTemplate for custom HuggingFace model #322

Closed
joshpopelka20 opened this issue May 16, 2024 · 3 comments
Closed

Use PromptTemplate for custom HuggingFace model #322

joshpopelka20 opened this issue May 16, 2024 · 3 comments

Comments

@joshpopelka20
Copy link
Contributor

I'm trying to use a HF hub model that allows for function calling. From the docs, it seems as long as you have an access_token, you can use an HF model. This is the code for the model I want to use:

llm = Runner(
    which=Which.GGUF(
        tok_model_id="NousResearch/Meta-Llama-3-8B-Instruct",
        quantized_model_id="NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF",
        quantized_filename="Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
        tokenizer_json=None,
        repeat_last_n=64,
    ),
    token_source=access_token,
)

I want to pass a custom Prompt (or Prompt Template) using the Prompt that the model uses for json mode.

This is the code that I've tried, but it just seems to hang:

output = llm.send_completion_request(
    CompletionRequest(
        model="llama",
        prompt=prompt,
        echo_prompt=True,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)

Any idea on how to send a custom Prompt?

@joshpopelka20
Copy link
Contributor Author

It loads the model in iteractive mode ./mistralrs_server -i --token-source "literal:hf_..." --port 1234 --log output.log gguf -m NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF -t NousResearch/Meta-Llama-3-8B-Instruct -f Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf and runs pretty quickly when I send a simple prompt.

2024-05-17T15:42:52.898933Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: Hermes-2-Pro-Llama-3-8B
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 8192
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128288
2024-05-17T15:43:07.255098Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|end_of_text|>", "<|eot_id|>", unk_tok = `None`
2024-05-17T15:43:07.278012Z  INFO mistralrs_server: Model loaded.

I see some documentation in the README https://github.com/EricLBuehler/mistral.rs/blob/master/docs/CHAT_TOK.md about chat_templates, but it seems to be missing the examples https://github.com/EricLBuehler/mistral.rs/blob/master/docs/chat_templates.

Can you provide some examples of Chat Templates that can be used?

For the HF gguf model, that I'm using, this is the suggested PromptTemplate for json mode:

<|im_start|>system
You are a helpful assistant that answers in JSON. Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema><|im_end|>

@EricLBuehler
Copy link
Owner

Hi @joshpopelka20!

#327 added some docs and fixed the broken link.

As you can see in this file: https://github.com/EricLBuehler/mistral.rs/blob/master/chat_templates/chatml.json, all you need to do is specify the full chat template (given inputs messages, add_generation_prompt, bos_token, eos_token, and unk_token), and pass that file path:

./mistralrs-server --port 1234 --log output.log --chat-template ./chat_templates/chatml.json llama

@joshpopelka20
Copy link
Contributor Author

Excellent! I'll test this out.

In the meantime, I found a workaround using ChatCompletionRequest:

messages = [
    {"role": "system", "content": "You are a helpful assistant that only answers in JSON. Here's the json schema you must adhere to:\n<schema>\n{{" + pydantic_schema + "}}\n<schema>\n"},
    {"role": "user", "content": prompt}
]

output = llm.send_chat_completion_request(
    ChatCompletionRequest(
        model="llama",
        messages=messages,
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
)

Hope this helps the next dev looking into something similar.

Also, thanks for working on this open-source project, I was able to get an approx. 90% improvement in response time. Looking forward to more optimizations to decrease the response time further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants