CogVLM2

👋 Join our Wechat · 💡Try it Online

📍Experience the larger-scale CogVLM model on the ZhipuAI Open Platform.

Recent updates

🔥🔥 News: 2024/6/8:We release CogVLM2 TGI Weight, which is a model can be inferred in TGI. See Inference Code in here
🔥 News: 2024/6/5:We release GLM-4V-9B, which use the same data and training recipes as CogVLM2 but with GLM-9B as the language backbone. We removed visual experts to reduce the model size to 13B. More details at GLM-4 repo.
🔥 News: 2024/5/24: We have released the Int4 version model, which requires only 16GB of video memory for inference. You can also run on-the-fly int4 version by passing --quant 4.
🔥 News: 2024/5/20: We released the next generation model CogVLM2, which is based on llama3-8b and is equivalent (or better) to GPT-4V in most cases ! Welcome to download!

Model introduction

We launch a new generation of CogVLM2 series of models and open source two models based on Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:

Significant improvements in many benchmarks such as TextVQA, DocVQA.
Support 8K content length.
Support image resolution up to 1344 * 1344.
Provide an open source model version that supports both Chinese and English.

You can see the details of the CogVLM2 family of open source models in the table below:

Model name	cogvlm2-llama3-chat-19B	cogvlm2-llama3-chinese-chat-19B
Base Model	Meta-Llama-3-8B-Instruct	Meta-Llama-3-8B-Instruct
Language	English	Chinese, English
Model size	19B	19B
Task	Image understanding, dialogue model	Image understanding, dialogue model
Model link	🤗 Huggingface 🤖 ModelScope 💫 Wise Model	🤗 Huggingface 🤖 ModelScope 💫 Wise Model
Demo Page	📙 Official Demo	📙 Official Demo 🤖 ModelScope
Int4 model	🤗 Huggingface 🤖 ModelScope	🤗 Huggingface 🤖 ModelScope
Text length	8K	8K
Image resolution	1344 * 1344	1344 * 1344

Benchmark

Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models, as shown in the table below:

Model	Open Source	LLM Size	TextVQA	DocVQA	ChartQA	OCRbench	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	56.8	67.7	75.0
CogVLM2-LLaMA3	✅	8B	84.2	92.3	81.0	756	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese	✅	8B	85.0	88.4	74.7	780	42.8	60.5	78.9

All reviews were obtained without using any external OCR tools ("pixel only").

Project structure

This open source repos will help developers to quickly get started with the basic calling methods of the CogVLM2 open source model, fine-tuning examples, OpenAI API format calling examples, etc. The specific project structure is as follows, you can click to enter the corresponding tutorial link:

basic_demo folder includes:
- CLI demo.
- CLI demo with multiple GPUs .
- Web demo by chainlit.
- API server with OpenAI format.
- Int4 is enabled easily with --quant 4 with 16GB memory usage.
finetune_demo folder includes.
- peft framework examples for efficient finetuning.
- [TODO] sat framework examples for reliable finetuning.
- [TODO] transformation scripts to convert checkpoints from sat to huggingface format.

Useful Links

In addition to the official inference code, you can also refer to the following community-provided inference solutions:

xinference

License

This model is released under the CogVLM2 CogVLM2 LICENSE. For models built with Meta Llama 3, please also adhere to the LLAMA3_LICENSE.

Citation

If you find our work helpful, please consider citing the following papers

@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
basic_demo		basic_demo
finetune_demo		finetune_demo
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
MODEL_LICENSE		MODEL_LICENSE
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

basic_demo

basic_demo

finetune_demo

finetune_demo

resources

resources

.gitignore

.gitignore

LICENSE

LICENSE

MODEL_LICENSE

MODEL_LICENSE

README.md

README.md

README_zh.md

README_zh.md

Repository files navigation

CogVLM2

Recent updates

Model introduction

Benchmark

Project structure

Useful Links

License

Citation

About

Releases

Packages

Contributors 6

Languages

License

THUDM/CogVLM2

Folders and files

Latest commit

History

Repository files navigation

CogVLM2

Recent updates

Model introduction

Benchmark

Project structure

Useful Links

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages