LLaVa-Next: Update docs with batched inference #30857

zucchini-nlp · 2024-05-16T12:25:34Z

What does this PR do?

PR adds an example of batched inference for Llava-next in the docs section. Also as discussed with @NielsRogge , I verified that we can support multi-turn conversation format by feeding multiple images in one prompt. Both usage cases are added to the example code snippet.

zucchini-nlp · 2024-05-16T12:26:21Z

I will see check how the fine-tuning works and will make a short colab on that 😄

NielsRogge

Thanks for adding! Relevant for #29850

docs/source/en/model_doc/llava_next.md

HuggingFaceDocBuilderDev · 2024-05-16T12:47:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

amyeroberts

Thanks for updating!

Q about the expected inputs to the processor

amyeroberts · 2024-05-16T14:34:36Z

docs/source/en/model_doc/llava_next.md

+# We can simply feed images in the order they have to be used in the text prompt
+# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
+inputs = processor(text=prompt, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device)


This is actually quite surprising behaviour and doesn't match with other models which take multiple images per prompt e.g. Idefics2. I should have caught this in the #29850 PR. I would have expected the images to be in the structure [[image_stop, image_cats], [image_snowman]]. Can the processor accept both?

Hmm, did not know Idefics is different. No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.

Also, I did the same thing for Video-LLaVa which simply aligns images on a rolling basis, replacing the special token

Does that mean we should make it Idefics style? I guess it can be done by flattening a list inside an image processor, if we get a nested list. In other words, change make_list_of_images to accept nested image list

No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.

The best analogy is with video image processor. The input to these can be:

A single image (frame)

A list of images (series of frames, with batch size 1): [image_0_0, image_0_1, image_0_2]

A nested list of images (series of frames with batch size b): [[image_0_0, image_0_1, image_0_2], [image_1_0, image_1_1, image_1_2]]

This is in effect a video-like input, where the number of frames can vary per sample.

It's also more analogous to the question-answering input format for tokenizers, where the structure for pairs of sentences to be tokenized is: [[text_a_0, text_a_1], [text_b_0, text_b_1]].

I'd rather the structure was more like this, as it makes it explicit and removes ambiguity with the input format between other image processors.

Since it already accepts this format, what I would suggest is enabling accepting either this flat format, or the nested format. This way, users will be more able to seamlessly switch between different models when using Auto classes

Done, similar to video processors I added a make_batch function which flattens the list if it's nested.

From user perspective nothing changes and the processor returns same shapes if nested list is passed. Added a test for that

amyeroberts

Thanks for adding and for making the inputs more consistent across processors!

* update docs with batch ex * Update docs/source/en/model_doc/llava_next.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * accept nested list of img --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

update docs with batch ex

43dcd0f

zucchini-nlp requested a review from NielsRogge May 16, 2024 12:25

NielsRogge approved these changes May 16, 2024

View reviewed changes

docs/source/en/model_doc/llava_next.md Outdated Show resolved Hide resolved

Update docs/source/en/model_doc/llava_next.md

dc9090b

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

zucchini-nlp requested a review from amyeroberts May 16, 2024 12:48

amyeroberts reviewed May 16, 2024

View reviewed changes

accept nested list of img

8be03ec

amyeroberts approved these changes May 20, 2024

View reviewed changes

zucchini-nlp merged commit 5d0bf59 into huggingface:main May 20, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaVa-Next: Update docs with batched inference #30857

LLaVa-Next: Update docs with batched inference #30857

zucchini-nlp commented May 16, 2024

zucchini-nlp commented May 16, 2024

NielsRogge left a comment •

edited

HuggingFaceDocBuilderDev commented May 16, 2024

amyeroberts left a comment

amyeroberts May 16, 2024

zucchini-nlp May 16, 2024

zucchini-nlp May 16, 2024 •

edited

amyeroberts May 16, 2024

zucchini-nlp May 20, 2024

amyeroberts left a comment

LLaVa-Next: Update docs with batched inference #30857

LLaVa-Next: Update docs with batched inference #30857

Conversation

zucchini-nlp commented May 16, 2024

What does this PR do?

zucchini-nlp commented May 16, 2024

NielsRogge left a comment • edited

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 16, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts May 16, 2024

Choose a reason for hiding this comment

zucchini-nlp May 16, 2024

Choose a reason for hiding this comment

zucchini-nlp May 16, 2024 • edited

Choose a reason for hiding this comment

amyeroberts May 16, 2024

Choose a reason for hiding this comment

zucchini-nlp May 20, 2024

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

NielsRogge left a comment •

edited

zucchini-nlp May 16, 2024 •

edited