Use PromptTemplate for custom HuggingFace model #322

joshpopelka20 · 2024-05-16T20:36:13Z

I'm trying to use a HF hub model that allows for function calling. From the docs, it seems as long as you have an access_token, you can use an HF model. This is the code for the model I want to use:

llm = Runner(
    which=Which.GGUF(
        tok_model_id="NousResearch/Meta-Llama-3-8B-Instruct",
        quantized_model_id="NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF",
        quantized_filename="Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
        tokenizer_json=None,
        repeat_last_n=64,
    ),
    token_source=access_token,
)

I want to pass a custom Prompt (or Prompt Template) using the Prompt that the model uses for json mode.

This is the code that I've tried, but it just seems to hang:

output = llm.send_completion_request(
    CompletionRequest(
        model="llama",
        prompt=prompt,
        echo_prompt=True,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)

Any idea on how to send a custom Prompt?

The text was updated successfully, but these errors were encountered:

joshpopelka20 · 2024-05-17T15:56:58Z

It loads the model in iteractive mode ./mistralrs_server -i --token-source "literal:hf_..." --port 1234 --log output.log gguf -m NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF -t NousResearch/Meta-Llama-3-8B-Instruct -f Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf and runs pretty quickly when I send a simple prompt.

2024-05-17T15:42:52.898933Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: Hermes-2-Pro-Llama-3-8B
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 8192
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128288
2024-05-17T15:43:07.255098Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|end_of_text|>", "<|eot_id|>", unk_tok = `None`
2024-05-17T15:43:07.278012Z  INFO mistralrs_server: Model loaded.

I see some documentation in the README https://github.com/EricLBuehler/mistral.rs/blob/master/docs/CHAT_TOK.md about chat_templates, but it seems to be missing the examples https://github.com/EricLBuehler/mistral.rs/blob/master/docs/chat_templates.

Can you provide some examples of Chat Templates that can be used?

For the HF gguf model, that I'm using, this is the suggested PromptTemplate for json mode:

<|im_start|>system
You are a helpful assistant that answers in JSON. Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema><|im_end|>

EricLBuehler · 2024-05-19T09:10:45Z

Hi @joshpopelka20!

#327 added some docs and fixed the broken link.

As you can see in this file: https://github.com/EricLBuehler/mistral.rs/blob/master/chat_templates/chatml.json, all you need to do is specify the full chat template (given inputs messages, add_generation_prompt, bos_token, eos_token, and unk_token), and pass that file path:

./mistralrs-server --port 1234 --log output.log --chat-template ./chat_templates/chatml.json llama

joshpopelka20 · 2024-05-20T15:34:59Z

Excellent! I'll test this out.

In the meantime, I found a workaround using ChatCompletionRequest:

messages = [
    {"role": "system", "content": "You are a helpful assistant that only answers in JSON. Here's the json schema you must adhere to:\n<schema>\n{{" + pydantic_schema + "}}\n<schema>\n"},
    {"role": "user", "content": prompt}
]

output = llm.send_chat_completion_request(
    ChatCompletionRequest(
        model="llama",
        messages=messages,
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0,
    )
)

Hope this helps the next dev looking into something similar.

Also, thanks for working on this open-source project, I was able to get an approx. 90% improvement in response time. Looking forward to more optimizations to decrease the response time further.

EricLBuehler mentioned this issue May 17, 2024

Improve chat templates docs #327

Merged

joshpopelka20 closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PromptTemplate for custom HuggingFace model #322

Use PromptTemplate for custom HuggingFace model #322

joshpopelka20 commented May 16, 2024

joshpopelka20 commented May 17, 2024

EricLBuehler commented May 19, 2024

joshpopelka20 commented May 20, 2024

Use PromptTemplate for custom HuggingFace model #322

Use PromptTemplate for custom HuggingFace model #322

Comments

joshpopelka20 commented May 16, 2024

joshpopelka20 commented May 17, 2024

EricLBuehler commented May 19, 2024

joshpopelka20 commented May 20, 2024