Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaVa-Next: Update docs with batched inference #30857

Merged
merged 3 commits into from
May 20, 2024

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

PR adds an example of batched inference for Llava-next in the docs section. Also as discussed with @NielsRogge , I verified that we can support multi-turn conversation format by feeding multiple images in one prompt. Both usage cases are added to the example code snippet.

@zucchini-nlp
Copy link
Member Author

I will see check how the fine-tuning works and will make a short colab on that 😄

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding! Relevant for #29850

docs/source/en/model_doc/llava_next.md Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

Q about the expected inputs to the processor

Comment on lines +129 to +131
# We can simply feed images in the order they have to be used in the text prompt
# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
inputs = processor(text=prompt, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually quite surprising behaviour and doesn't match with other models which take multiple images per prompt e.g. Idefics2. I should have caught this in the #29850 PR. I would have expected the images to be in the structure [[image_stop, image_cats], [image_snowman]]. Can the processor accept both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, did not know Idefics is different. No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.

Also, I did the same thing for Video-LLaVa which simply aligns images on a rolling basis, replacing the special token

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we should make it Idefics style? I guess it can be done by flattening a list inside an image processor, if we get a nested list. In other words, change make_list_of_images to accept nested image list

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.

The best analogy is with video image processor. The input to these can be:

  • A single image (frame)
  • A list of images (series of frames, with batch size 1): [image_0_0, image_0_1, image_0_2]
  • A nested list of images (series of frames with batch size b): [[image_0_0, image_0_1, image_0_2], [image_1_0, image_1_1, image_1_2]]

This is in effect a video-like input, where the number of frames can vary per sample.

It's also more analogous to the question-answering input format for tokenizers, where the structure for pairs of sentences to be tokenized is: [[text_a_0, text_a_1], [text_b_0, text_b_1]].

I'd rather the structure was more like this, as it makes it explicit and removes ambiguity with the input format between other image processors.

Since it already accepts this format, what I would suggest is enabling accepting either this flat format, or the nested format. This way, users will be more able to seamlessly switch between different models when using Auto classes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, similar to video processors I added a make_batch function which flattens the list if it's nested.

From user perspective nothing changes and the processor returns same shapes if nested list is passed. Added a test for that

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and for making the inputs more consistent across processors!

@zucchini-nlp zucchini-nlp merged commit 5d0bf59 into huggingface:main May 20, 2024
21 checks passed
itazap pushed a commit that referenced this pull request May 24, 2024
* update docs with batch ex

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* accept nested list of img

---------

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants