Skip to content

Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.

Notifications You must be signed in to change notification settings

zjr2000/Awesome-Multimodal-Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 

Repository files navigation

Awesome-Multimodal-Chatbot Awesome

Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. It is designed to assist users in performing various tasks, from simple information retrieval to complex multimedia reasoning.

Multimodal Instruction Tuning

  • MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

    arXiv 2022/12 [paper]

  • GPT-4

    arXiv 2023/03 [paper] [blog]

  • Visual Instruction Tuning Star

    arXiv 2023/04 [paper] [code] [project page] [demo]

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Star

    arXiv 2023/04 [paper] [code] [project page] [demo]

  • mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Star

    arXiv 2023/04 [paper] [code] [demo]

  • LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Star

    arXiv 2023/04 [paper] [code] [demo]

  • Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding Star

    [code]

  • LMEye: An Interactive Perception Network for Large Language Models Star

  • arXiv 2023/05 [paper] [code]

  • MultiModal-GPT: A Vision and Language Model for Dialogue with Humans Star

    arXiv 2023/05 [paper] [code] [demo]

  • X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages Star

    arXiv 2023/05 [paper] [code] [project page]

  • Otter: A Multi-Modal Model with In-Context Instruction Tuning Star

    arXiv 2023/05 [paper] [code] [demo]

  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Star

    arXiv 2023/05 [paper] [code]

  • InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language Star

    arXiv 2023/05 [paper] [code] [demo]

  • VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric TasksStar

    arXiv 2023/05 [paper] [code]

  • Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language ModelsStar

  • arXiv 2023/05 [paper] [code] [project page]

  • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Star

    arXiv 2023/05 [paper] [code] [project page]

  • DetGPT: Detect What You Need via Reasoning Star

    arXiv 2023/05 [paper] [code] [project page]

  • PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology Star

    arXiv 2023/05 [paper] [code]

  • ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Star

    arXiv 2023/05 [paper] [code] [project page]

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Star

    arXiv 2023/06 [paper] [code]

  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

    arXiv 2023/06 [paper]

  • Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

    arXiv 2023/06 [paper] [project page]

  • VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY Star

    arXiv 2023/06 [paper] [code]

LLM-Based Modularized Frameworks

  • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Star

    arXiv 2023/03 [paper] [code] [demo]

  • ViperGPT: Visual Inference via Python Execution for Reasoning Star

    arXiv 2023/03 [paper] [code] [project page]

  • TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs Star

    arXiv 2023/03 [paper] [code]

  • Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions Star

    arXiv 2023/03 [paper] [code]

  • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Star

    arXiv 2023/03 [paper] [code] [project page] [demo]

  • Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface Star

    arXiv 2023/03 [paper] [code] [demo]

  • VLog: Video as a Long Document Star

    [code] [demo]

  • Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions Star

    arXiv 2023/04 [paper] [code]

  • ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

    arXiv 2023/04 [paper] [project page]

  • VideoChat: Chat-Centric Video Understanding Star

    arXiv 2023/05 [paper] [code] [demo]

About

Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.

Topics

Resources

Stars

Watchers

Forks