Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for the Falcon2-11B VLM #30854

Closed

Conversation

YasserdahouML
Copy link

This PR adds support for the Falcon2-11B Vision Language Model. The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data.
For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.


The abstract from the paper is the following:

The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.
*The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.*

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be added


logger = logging.get_logger(__name__)

FALCON_VLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, the ignore copy on the class is not gonna do anything. Either remove it or entirely remove the class if everything is copied from, or make sure copied from is used were needed

_CONFIG_FOR_DOC = "FalconVlmConfig"


def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, copied from missing right?

self.vocab_size = model_embeds.num_embeddings
return model_embeds

def _merge_input_ids_with_image_features(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last time I reviewed this was copied form Llava or llavaNext, what happened here?


# Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Falcon
def decode(self, *args, **kwargs):
# Ignore copy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suprised that this works

the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You told me offline this the way you recomment to use it, but I don't think this is a good idea, this is not our common api so would not really recommend that.
You probably need to define a chat template.

from PIL import Image


class FalconVlmVisionText2TextModelTester:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGAIN this should be copied from

params_tied_2 = list(model_tied.parameters())
self.assertEqual(len(params_tied_2), len(params_tied))

@unittest.skip(reason="We don't support this")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the 3 following tests, what is the reason for this?

@YasserdahouML YasserdahouML deleted the falcon-11b-vl branch May 16, 2024 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants