[Usage] How to run inference for llava-next-72b? #1503

nomadlx · 2024-05-15T13:03:52Z

Describe the issue

Issue:
How to run inference for llava-next-72b/llava-next-110b?

There are too many versions of your llava, and it seems that the code is not compatible, and there are multiple related repositories, which makes me confused

https://github.com/haotian-liu/LLaVA ：seem that only support llava-1.5？both train and inference？
https://github.com/LLaVA-VL/LLaVA-NeXT/tree/inference ：seem support only inference for llava-next and llava-next-video？
https://github.com/EvolvingLMMs-Lab/lmms-eval ：whether llava-next-72b is supported？
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md ：This seems to give a running example, does it support llava-next-72b/llava-next-110b to run? Are there any code version requirements?

gyupro · 2024-05-16T07:46:07Z

Why are there too many repos and have same content? It's really confusing

Luodian · 2024-05-16T08:11:22Z

It's using same code structure but with different content, you could regard llava-next-72b/110b relies on upgraded llava repo.

Since it's team efforts, we release it to a team repo.

For llava-next-72b inference, use the provided code, thanks!
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md

QAs:

https://github.com/EvolvingLMMs-Lab/lmms-eval ：
Q: whether llava-next-72b is supported？
A: Yes, supported.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md ：
Q: This seems to give a running example, does it support llava-next-72b/llava-next-110b to run? Are there any code version requirements?
A: Use the repo's provided environment for llava-next-72b/110b. Also expectedly it backward compatible to llava-next-34b and llava 1.5.

gyupro · 2024-05-16T08:20:25Z

@Luodian

Thank you for your team's effort.

I was wondering how much gpu ram(or how many A100s?) you need to run llava-next 72b and 110b?

Your team's research helps a lot to the open source community. Thank you !

Luodian · 2024-05-16T08:22:47Z

Never mind!

We have a model card to demonstrate this info~

https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

nomadlx · 2024-05-16T13:25:50Z

https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

How should I set model_name and conv_template for llava-next-72b/110b?
Is there anything else I should be aware of?

Luodian · 2024-05-17T02:38:55Z

For llava-next-72b/110b at lmms-eval, conv_template=qwen_1_5.

At slgang, it's slightly different, you can check the examples/usage folder at sglang's repo.

rafaelrdias · 2024-05-19T22:35:16Z

For llava-next-72b/110b at lmms-eval, conv_template=qwen_1_5.

At slgang, it's slightly different, you can check the examples/usage folder at sglang's repo.

I've trying to find this "examples/usage" but I didn't! Can you bring more details here I can find it, please?

nomadlx · 2024-05-20T04:14:36Z

For llava-next-72b/110b at lmms-eval, conv_template=qwen_1_5.
At slgang, it's slightly different, you can check the examples/usage folder at sglang's repo.

I've trying to find this "examples/usage" but I didn't! Can you bring more details here I can find it, please?

It works for me, use example in https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md and setting model_name = "llava_qwen72b", conv_template = "qwen_1_5"

pseudotensor · 2024-05-23T06:32:21Z

It's using same code structure but with different content, you could regard llava-next-72b/110b relies on upgraded llava repo.

Since it's team efforts, we release it to a team repo.

For llava-next-72b inference, use the provided code, thanks! https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md

QAs:

https://github.com/EvolvingLMMs-Lab/lmms-eval ： Q: whether llava-next-72b is supported？ A: Yes, supported.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md ： Q: This seems to give a running example, does it support llava-next-72b/llava-next-110b to run? Are there any code version requirements? A: Use the repo's provided environment for llava-next-72b/110b. Also expectedly it backward compatible to llava-next-34b and llava 1.5.

@Luodian

I'm also confused. I use the llava inference engine stuff (server-worker etc.) but the new repo has none of that. Is only non-server based stuff supported for new models?

By new models, I mean llama-3 based, Qwen based, etc. Can only llava 1.5-1.6 be ran on server-client platform?

Luodian · 2024-05-23T06:49:58Z

@pseudotensor Let me try to explain and clarify your confusion.

In LLaVA's original server demo, it's basically use sglang (stream mode) as endpoint model. Although the sglang_worker.py is little bit complicated, but the logic of this file is to post a request to sglang endpoint model and get a response in stream mode, and then feedback to gradio_web_server.py to display on frontend textbox.

For llava-next-72b, we provide the inference code with sglang and evaluation with lmms-eval.
You can refer here to see the usage of sglang's http/srt_runtime

https://github.com/sgl-project/sglang/tree/main/examples/usage/llava.

In sglang, you dont need to install llava since we already write llava code into it.

In lmms-eval, you need to install LLaVA-VL/LLaVA-NeXT repo's llava code to make sure you could correctly execute "from llava.xxx import xxx".

install here means you need to use pip install -e . to make sure your current environment have the llava package.

There may also be some glitches, if it's naming issue or small bugs, you can try to hack and solve it. If there's pretty weird issue, could email or ping me at this thread.

Luodian · 2024-05-23T06:54:05Z

If you want to execute batch inference with sglang (around 5x speed up than vanilla pytorch inference). Roughly seeing it finishes 2000 QAs within 30min at 8xA100-80G.

I could provide you the code I am constantly using at my side.

import argparse
import json
import time
import os
import requests

import sglang as sgl
import tqdm
from sglang.test.test_utils import select_sglang_backend
from sglang.utils import dump_state_text
from sglang.lang.chat_template import get_chat_template

import warnings

warnings.filterwarnings("ignore")


@sgl.function
def image_description(s, image_url, prompt, stop_words=[]):
    # prompt = "Please generate detailed descriptions of the given image."
    s += sgl.user(sgl.image(image_url) + prompt)
    s += sgl.assistant(sgl.gen("answer", max_tokens=args.max_tokens, temperature=0.7, top_p=1.0, stop=stop_words))


def main(args):
    import multiprocessing as mp

    mp.set_start_method("spawn", force=True)
    model_name = "lmms-lab/llava-next-72b"
    tokenizer_name = "lmms-lab/llavanext-qwen-tokenizer"
    template_name = "chatml-llava"
    stop_words = ["<|im_end|>"]
    runtime = sgl.Runtime(
        model_path=model_name,
        tokenizer_path=tokenizer_name,
        tp_size=8,
    )
    runtime.endpoint.chat_template = get_chat_template(template_name)
    # Select backend
    sgl.set_default_backend(runtime)
    print(f"chat template: {runtime.endpoint.chat_template.name}")

    file_list = [
        "xxxxxxxxxxxxxxxx.json"
    ]

    image_path_root_list = [
        "xxxxxxxxxxxxxxxx"
    ]

    total_annotations = []
    
    for file, image_path_root in zip(file_list, image_path_root_list):
        with open(file, "r") as f:
            queries = json.load(f)

        tic = time.time()
        batch_size = 16
        annotations = []
        idx = 0
        print(f"Start processing {file}, {idx} / {len(file_list)}")
        for batch_start in tqdm.tqdm(range(0, len(queries), batch_size)):
            idx += 1

            batch_end = min(batch_start + batch_size, len(queries))
            batch_queries = queries[batch_start:batch_end]
            # check if each image exists
            actual_batch_queries = []
            for query in batch_queries:
                image_path = os.path.join(image_path_root, query["image"])
                if os.path.exists(image_path):
                    actual_batch_queries.append(query)

            batch_arguments = []

            for query in actual_batch_queries:
                image_path = os.path.join(image_path_root, query["image"])
                question_id = query["question_id"]
                question = query["question"]
                if "<image>" not in question:
                    question = f"<image>\n{question}. Please carefully inspect this image and answer with a mid-length response."
                batch_arguments.append({"image_url": image_path, "prompt": question, "stop_words": stop_words})

            batch_results = image_description.run_batch(batch_arguments, temperature=0, num_threads=batch_size, progress_bar=False)

            for result, query in zip(batch_results, actual_batch_queries):
                model_response = result.text().split("assistant")[-1].replace(stop_words[0], "").replace("<|end_header_id|>", "").strip()
                # print(f"############## Model response ################\n{model_response}\n############## End ################")
                annotations.append(
                    {
                        "image_path": query["image"],
                        "question_id": query["question_id"],
                        "question": query["question"],
                        "model_response": model_response,
                        "model_name": model_name,
                        "tokenizer_name": tokenizer_name,
                        "template": runtime.endpoint.chat_template.name,
                    }
                )

        latency = time.time() - tic
        print(f"Latency: {latency:.3f}")

        task_name = file.split("/")[-2]
        result_file = file.replace(".json", f"_{model_name.split('/')[-1]}_{task_name}_response.json")
        print(f"Write output to {result_file}")
        with open(result_file, "w") as fout:
            json.dump(annotations, fout, indent=4, ensure_ascii=False)

        total_annotations.extend(annotations)

    total_annotations_file = "./total_response.json"
    with open(total_annotations_file, "w") as fout:
        json.dump(total_annotations, fout, indent=4, ensure_ascii=False)

    runtime.shutdown()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--max_tokens", type=int, default=1024)
    parser.add_argument("--backend", type=str, default="srt")
    parser.add_argument("--model_name", type=str, default="llava_next_34b")
    parser.add_argument("--host", type=str, default="http://127.0.0.1")
    parser.add_argument("--port", type=int, default=30000)
    parser.add_argument("--result_file", type=str, default="./testmini_model_response.json")
    args = parser.parse_args()
    main(args)

pseudotensor · 2024-05-23T07:00:29Z

Thanks! So you do recommend that if I wanted a server-client setup that doesn't use gradio, I should write a FastAPI wrapper around the kind of code you shared above that involves sglang. So sglang + fastapi would be cleanest? I should no longer try to use the non-sglang server-worker stuff?

In past it was recommended to use the server-worker model but also use sglang with that.

I ask because there's alot of code in the gradio stuff that handles various things related to each model, that is not really needed when using sglang?

Normally I launch like:

source ~/miniconda3/etc/profile.d/conda.sh
conda activate llava
echo "First conda env: $CONDA_DEFAULT_ENV"

export server_port=10000

if [ 1 -eq 1 ]
   then
python -m llava.serve.controller --host 0.0.0.0 --port $server_port &> 1.log &
fi

if [ 1 -eq 1 ]
   then
export CUDA_VISIBLE_DEVICES=1
export worker_port=40000
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://$IP:$server_port --port $worker_port --worker http://$IP:$worker_port --model-path liuhaotian/llava-v1.6-vicuna-13b --limit-model-concurrency 5 &> 2.log &
fi

if [ 1 -eq 1 ]
   then
export CUDA_VISIBLE_DEVICES=3
export worker_port=40002
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://$IP:$server_port --port $worker_port --worker http://$IP:$worker_port --model-path liuhaotian/llava-v1.6-34b --limit-model-concurrency 5 &>> 34b.log &
fi

sleep 60
if [ 1 -eq 1 ]
   then
  GRADIO_SERVER_PORT=7860 python -m llava.serve.gradio_web_server --controller http://$IP:$server_port --model-list-mode once &>> 3b2.log &
fi

Can this server-worker-gradio stuff no longer be used with newer llava models?

Luodian · 2024-05-23T07:16:21Z

Yes, if you only want to init a model that supports API request. Like you are providing API service or having a totally disentangled frontend that sends request to backend server, I think you should refer the http usage of sglang.

That's the cleanest way. Personally I tried it, using cloudflared and can successfully host a backend service that supports other side using API to evaluate llava-next-72b.

If your scenarios are close, then should definitely use FastAPI + SGLang.

Luodian · 2024-05-23T07:25:26Z

For llava-next-72b/110b at lmms-eval, conv_template=qwen_1_5.
At slgang, it's slightly different, you can check the examples/usage folder at sglang's repo.

I've trying to find this "examples/usage" but I didn't! Can you bring more details here I can find it, please?

Sorry for missing this message, the code is (only inference, not with eval) here: https://github.com/sgl-project/sglang/tree/main/examples/usage/llava.

pseudotensor · 2024-05-23T07:30:12Z

@Luodian Thanks.

What about the sglang's own OpenAI API? Does that work for vision models like llava too?

https://github.com/sgl-project/sglang?tab=readme-ov-file#openai-compatible-api

They don't seem to document how that would work for vision models.

Luodian · 2024-05-23T07:31:32Z

Seems it's not ready for it.

pseudotensor · 2024-05-23T08:25:50Z

Thanks. So by http I think you mean this right?

https://github.com/sgl-project/sglang/blob/main/test/srt/test_httpserver_llava.py

Seems to support image.

But unsure if only as url or can be byte stream or needs to be markdown url like or what. Not clear.

That example also assumes the file is on the same host as server and client, which is not probably usual.

Luodian · 2024-05-23T09:51:13Z

Thanks. So by http I think you mean this right?

https://github.com/sgl-project/sglang/blob/main/test/srt/test_httpserver_llava.py

Seems to support image.

But unsure if only as url or can be byte stream or needs to be markdown url like or what. Not clear.

That example also assumes the file is on the same host as server and client, which is not probably usual.

Actually it's here: https://github.com/sgl-project/sglang/tree/main/examples/usage/llava.

The later is my implementation with reference to the former code. They are basically the same but later is for llava-next-72b and applies qwen template.

pseudotensor · 2024-05-23T10:04:20Z

Got it, cool.

But that also uses a url. Does the server support direct binary bytes like one would send inside markdown?

e.g.

![Hello World]data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEYAAAAUCAAAAAAVAxSkAAABrUlEQVQ4y+3TPUvDQBgH8OdDOGa+oUMgk2MpdHIIgpSUiqC0OKirgxYX8QVFRQRpBRF8KShqLbgIYkUEteCgFVuqUEVxEIkvJFhae3m8S2KbSkcFBw9yHP88+eXucgH8kQZ/jSm4VDaIy9RKCpKac9NKgU4uEJNwhHhK3qvPBVO8rxRWmFXPF+NSM1KVMbwriAMwhDgVcrxeMZm85GR0PhvGJAAmyozJsbsxgNEir4iEjIK0SYqGd8sOR3rJAGN2BCEkOxhxMhpd8Mk0CXtZacxi1hr20mI/rzgnxayoidevcGuHXTC/q6QuYSMt1jC+gBIiMg12v2vb5NlklChiWnhmFZpwvxDGzuUzV8kOg+N8UUvNBp64vy9q3UN7gDXhwWLY2nMC3zRDibfsY7wjEkY79CdMZhrxSqqzxf4ZRPXwzWJirMicDa5KwiPeARygHXKNMQHEy3rMopDR20XNZGbJzUtrwDC/KshlLDWyqdmhxZzCsdYmf2fWZPoxCEDyfIvdtNQH0PRkH6Q51g8rFO3Qzxh2LbItcDCOpmuOsV7ntNaERe3v/lP/zO8yn4N+yNPrekmPAAAAAElFTkSuQmCC

Luodian · 2024-05-23T10:06:19Z

You can try and check the sglang interface if it supports base64 as "image_data" (I guess it supports). Otherwise could using another function to get the base64 image and save it to local instance as a temp file and then send to sglang endpoint.

pseudotensor · 2024-05-23T10:08:30Z

Ok will do.

If had to go through file, then server and client couldn't be on different disks/systems. It would be ok to use fastapi wrapper or something to manage, but one will hope sglang supports directly.

Luodian · 2024-05-23T10:11:09Z

I think I found that, yea it supports

sgl-project/sglang#212

pseudotensor · 2024-05-23T10:14:00Z

@Luodian Thanks for all your amazing help! Will try this all out tomorrow (PST here, it's late).

nomadlx · 2024-05-24T02:24:25Z

It's using same code structure but with different content, you could regard llava-next-72b/110b relies on upgraded llava repo.

Since it's team efforts, we release it to a team repo.

For llava-next-72b inference, use the provided code, thanks! https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md

QAs:

https://github.com/EvolvingLMMs-Lab/lmms-eval ： Q: whether llava-next-72b is supported？ A: Yes, supported.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT.md ： Q: This seems to give a running example, does it support llava-next-72b/llava-next-110b to run? Are there any code version requirements? A: Use the repo's provided environment for llava-next-72b/110b. Also expectedly it backward compatible to llava-next-34b and llava 1.5.

A slightly off-topic question, can I use the LLaVA v1.5 code to finetune the LLAVA-NEXT-72b /110b model? Use it directly or just modify it very simply, or do I need to rely on subsequent upgrades?
If rely on subsequent upgrade, when to be able to see it?

pseudotensor · 2024-05-24T02:25:12Z

@Luodian Everything worked perfectly for sglang and llama3 llava 8b model.

Having trouble with qwen based models, related to the discussion here:

sgl-project/sglang#467

ValueError: Unsupported architectures: LlavaQwenForCausalLM

I guess not supported yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage] How to run inference for llava-next-72b? #1503

[Usage] How to run inference for llava-next-72b? #1503

nomadlx commented May 15, 2024 •

edited

gyupro commented May 16, 2024

Luodian commented May 16, 2024 •

edited

gyupro commented May 16, 2024

Luodian commented May 16, 2024

nomadlx commented May 16, 2024

Luodian commented May 17, 2024

rafaelrdias commented May 19, 2024

nomadlx commented May 20, 2024

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

Luodian commented May 23, 2024

pseudotensor commented May 23, 2024

Luodian commented May 23, 2024

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

pseudotensor commented May 23, 2024

Luodian commented May 23, 2024

pseudotensor commented May 23, 2024

nomadlx commented May 24, 2024 •

edited

pseudotensor commented May 24, 2024

[Usage] How to run inference for llava-next-72b? #1503

[Usage] How to run inference for llava-next-72b? #1503

Comments

nomadlx commented May 15, 2024 • edited

Describe the issue

gyupro commented May 16, 2024

Luodian commented May 16, 2024 • edited

gyupro commented May 16, 2024

Luodian commented May 16, 2024

nomadlx commented May 16, 2024

Luodian commented May 17, 2024

rafaelrdias commented May 19, 2024

nomadlx commented May 20, 2024

pseudotensor commented May 23, 2024 • edited

Luodian commented May 23, 2024 • edited

Luodian commented May 23, 2024 • edited

pseudotensor commented May 23, 2024 • edited

Luodian commented May 23, 2024 • edited

Luodian commented May 23, 2024

pseudotensor commented May 23, 2024

Luodian commented May 23, 2024

pseudotensor commented May 23, 2024 • edited

Luodian commented May 23, 2024 • edited

pseudotensor commented May 23, 2024 • edited

Luodian commented May 23, 2024 • edited

pseudotensor commented May 23, 2024

Luodian commented May 23, 2024

pseudotensor commented May 23, 2024

nomadlx commented May 24, 2024 • edited

pseudotensor commented May 24, 2024

nomadlx commented May 15, 2024 •

edited

Luodian commented May 16, 2024 •

edited

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

pseudotensor commented May 23, 2024 •

edited

Luodian commented May 23, 2024 •

edited

nomadlx commented May 24, 2024 •

edited