Mini-Gemini:

    Mining the Potential of Multi-modality Vision Language Models

    The Chinese University of Hong Kong

    Updates: Mini-Gemini is comming! We release the paper, code, data, models, and demo for Mini-Gemini.

    Abstract

    In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current framework with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpass the developed private models.



    Model

    The framework of Mini-Gemini is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

    BibTeX

    
    @article{li2024minigemini,
      title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
      author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
      journal={arXiv preprint arXiv:2403.18814},
      year={2024}
    }
      

    Acknowledgement

    This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

    Examples









    主站蜘蛛池模板: 久久精品免费一区二区三区 | 国产一区二区在线观看app| 国产午夜精品一区二区三区| 国产成人精品一区二区A片带套 | 亚洲AV无码一区二区三区在线观看| 久久一区二区三区99| 成人无码AV一区二区| 色一乱一伦一图一区二区精品 | 精品日韩亚洲AV无码一区二区三区| 天天躁日日躁狠狠躁一区| 国产成人无码一区二区三区| 国产精品一区二区香蕉| 亚洲一区二区三区高清在线观看 | 国产在线观看一区精品| 女人和拘做受全程看视频日本综合a一区二区视频 | 肉色超薄丝袜脚交一区二区| 国内精品无码一区二区三区| 精品国产鲁一鲁一区二区| 国产伦精品一区二区三区精品| 日本中文字幕在线视频一区| 亚洲av永久无码一区二区三区| 久久一区二区三区精品| 一本大道在线无码一区| 91一区二区在线观看精品| 国产一区二区三区亚洲综合 | 精品一区二区三区在线成人| 精品一区二区三区在线观看l| 亚洲一区免费观看| 国产成人一区二区动漫精品| 无码人妻精品一区二区三 | 国产成人精品一区二区三区免费| 日本在线视频一区二区| 伊人久久精品无码麻豆一区 | 久久久精品人妻一区二区三区| 天天躁日日躁狠狠躁一区| 97久久精品一区二区三区| 国产一区二区三区精品视频| 国产肥熟女视频一区二区三区| 精品无码av一区二区三区| 国产成人精品一区在线| 日韩美一区二区三区|