Adds support for the Falcon2-11B VLM #30854

YasserdahouML · 2024-05-16T11:17:37Z

This PR adds support for the Falcon2-11B Vision Language Model. The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data.
For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.

Update

ArthurZucker · 2024-05-16T13:12:31Z

docs/source/en/model_doc/falcon_vlm.md

+
+The abstract from the paper is the following:
+
+The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.


Suggested change

The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.

*The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.*

ArthurZucker · 2024-05-16T13:13:37Z

src/transformers/models/deprecated/_archive_maps.py

This should not be added

ArthurZucker · 2024-05-16T13:14:56Z

src/transformers/models/falcon_vlm/configuration_falcon_vlm.py

+
+logger = logging.get_logger(__name__)
+
+FALCON_VLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {


This should be removed

ArthurZucker · 2024-05-16T13:16:07Z

src/transformers/models/falcon_vlm/image_processing_falcon_vlm.py

Again here, the ignore copy on the class is not gonna do anything. Either remove it or entirely remove the class if everything is copied from, or make sure copied from is used were needed

ArthurZucker · 2024-05-16T13:16:37Z

src/transformers/models/falcon_vlm/modeling_falcon_vlm.py

+_CONFIG_FOR_DOC = "FalconVlmConfig"
+
+
+def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):


again, copied from missing right?

ArthurZucker · 2024-05-16T13:24:10Z

src/transformers/models/falcon_vlm/modeling_falcon_vlm.py

+        self.vocab_size = model_embeds.num_embeddings
+        return model_embeds
+
+    def _merge_input_ids_with_image_features(


last time I reviewed this was copied form Llava or llavaNext, what happened here?

ArthurZucker · 2024-05-16T13:25:37Z

src/transformers/models/falcon_vlm/processing_falcon_vlm.py

+
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Falcon
+    def decode(self, *args, **kwargs):
+        # Ignore copy


suprised that this works

ArthurZucker · 2024-05-16T13:26:43Z

src/transformers/models/falcon_vlm/processing_falcon_vlm.py

+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+


You told me offline this the way you recomment to use it, but I don't think this is a good idea, this is not our common api so would not really recommend that.
You probably need to define a chat template.

ArthurZucker · 2024-05-16T13:27:15Z

tests/models/falcon_vlm/test_modeling_falcon_vlm.py

+    from PIL import Image
+
+
+class FalconVlmVisionText2TextModelTester:


AGAIN this should be copied from

ArthurZucker · 2024-05-16T13:27:59Z

tests/models/falcon_vlm/test_modeling_falcon_vlm.py

+            params_tied_2 = list(model_tied.parameters())
+            self.assertEqual(len(params_tied_2), len(params_tied))
+
+    @unittest.skip(reason="We don't support this")


for the 3 following tests, what is the reason for this?

molbap and others added 30 commits March 4, 2024 15:58

Merge pull request huggingface#9 from huggingface/update

e536f6a

Update

Merge branch 'main' of github.com:huggingface/new-model-addition

ef8c0fb

Merge branch 'main' of github.com:huggingface/transformers into main

c899136

Merge branch 'main' of github.com:huggingface/transformers

23b12a3

Merge branch 'main' of github.com:huggingface/new-model-addition

4e4a957

remove unrelated changes

d3b066a

remove unrelated changes on phi and stable LM

9f78238

add: Test for Falcon 10B

9d4fc52

fix: formatting

29866da

fix: loading the falcon 10B in 8 bit precision using bitsanbytes.

214c278

fix: device placement

a24ac80

fix: broken tests.

05fb25a

fix: backwards compatibility for falcon 1B architecture.

2a33d1d

add falcon vlm as a seperate model

1fd2c96

add test for falcon vlm

abe78e1

correcting the output matching for test

c75622b

deleting unnecessary tests

7c7f990

correcting inputs for merging

8821d92

correcting the config

b0e85fc

correcting the config adding the features

b8cafe5

correcting the config adding the features with dict

a26c9e0

correcting the config adding the features with dict

e3d24fe

correcting the modeling falcon vlm

4c261f6

fixed copies to modeling

03e1220

deleting generate from modeling

ec2bcc5

fixed the modeling

94de91a

fixed modeling

616ec10

fixed modeling

4996afe

chore: removed skipped tests.

d57c3ec

fix: removed image_sizes from the inputs.

6651936

Nilabhra and others added 27 commits May 15, 2024 10:53

fix: formatting.

fcdccfc

skipped tests.

e6ee980

fixed flash attention documentation

cc2d7e8

fixed inputs merging function

cd8cef4

fixed copied of FalconVlmPreTrainedModel

0a144bc

adding newline

c0918c9

fixing the prompt creation

d10bcac

fixed readme

deb2438

chore: styling and repo-consistency

cb08531

fixing the test functions

0bf2cb9

fixed test functions

7255b9a

Xremove unrelated changes

070cc7b

remove unrelated changes on phi and stable LM

442ffee

add: Test for Falcon 10B

57085e9

fix: formatting

873da7a

fix: loading the falcon 10B in 8 bit precision using bitsanbytes.

e04291e

fix: device placement

6be00b6

fix: broken tests.

0f77b67

fixing modeling falcon

42a8538

chore: updated test.

0ecddee

chore: test_modeling_falcon.py to use the 11B model.

4d8ee4e

chore: minor edit

5c643ea

modeling falcon

db8bf61

fixing testing falcon

1eae244

latest fixups

031061c

merged main

584210b

fix: style, quality and consistency.

248d14d

ArthurZucker reviewed May 16, 2024

View reviewed changes

YasserdahouML closed this May 16, 2024

YasserdahouML deleted the falcon-11b-vl branch May 16, 2024 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for the Falcon2-11B VLM #30854

Adds support for the Falcon2-11B VLM #30854

YasserdahouML commented May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024

ArthurZucker May 16, 2024


		The abstract from the paper is the following:

		The Falcon2-11B VLM is a vision-language model (VLM) for additionally handling image inputs and answering the queries corresponding to the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data. For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.


		logger = logging.get_logger(__name__)

		FALCON_VLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {

		_CONFIG_FOR_DOC = "FalconVlmConfig"


		def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):

		from PIL import Image


		class FalconVlmVisionText2TextModelTester:

Adds support for the Falcon2-11B VLM #30854

Adds support for the Falcon2-11B VLM #30854

Conversation

YasserdahouML commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment