-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for the Falcon2-11B VLM #30854
Adds support for the Falcon2-11B VLM #30854
Conversation
|
||
The abstract from the paper is the following: | ||
|
||
The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next. | |
*The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be added
|
||
logger = logging.get_logger(__name__) | ||
|
||
FALCON_VLM_PRETRAINED_CONFIG_ARCHIVE_MAP = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here, the ignore copy on the class is not gonna do anything. Either remove it or entirely remove the class if everything is copied from, or make sure copied from is used were needed
_CONFIG_FOR_DOC = "FalconVlmConfig" | ||
|
||
|
||
def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, copied from missing right?
self.vocab_size = model_embeds.num_embeddings | ||
return model_embeds | ||
|
||
def _merge_input_ids_with_image_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last time I reviewed this was copied form Llava or llavaNext, what happened here?
|
||
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Falcon | ||
def decode(self, *args, **kwargs): | ||
# Ignore copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suprised that this works
the docstring of this method for more information. | ||
""" | ||
return self.tokenizer.decode(*args, **kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You told me offline this the way you recomment to use it, but I don't think this is a good idea, this is not our common api so would not really recommend that.
You probably need to define a chat template.
from PIL import Image | ||
|
||
|
||
class FalconVlmVisionText2TextModelTester: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AGAIN this should be copied from
params_tied_2 = list(model_tied.parameters()) | ||
self.assertEqual(len(params_tied_2), len(params_tied)) | ||
|
||
@unittest.skip(reason="We don't support this") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the 3 following tests, what is the reason for this?
This PR adds support for the Falcon2-11B Vision Language Model. The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data.
For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.