Configs2026年6月2日·1 分钟阅读

CogVLM — Open Visual Language Model with Deep Visual Understanding

A multimodal pretrained model that excels at image captioning, visual question answering, and visual grounding with strong benchmark performance.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
CogVLM Overview
直接安装命令
npx -y tokrepo@latest install 5c1dd4f8-5e1a-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

CogVLM is an open-source visual language model from Tsinghua University and Zhipu AI that achieves strong performance on visual understanding benchmarks. It uses a visual expert module to bridge vision and language representations without degrading the underlying language model capabilities.

What CogVLM Does

  • Answers questions about image content with detailed natural language responses
  • Performs visual grounding by identifying and locating objects referenced in text
  • Generates image captions and descriptions with contextual understanding
  • Supports multi-turn dialogue with persistent visual context
  • Handles OCR-like tasks including reading text from images and screenshots

Architecture Overview

CogVLM introduces a trainable visual expert module in each transformer layer that processes visual tokens through dedicated attention and FFN weights. This design preserves the original language model weights while adding visual understanding capacity through the expert pathway. The vision encoder extracts features from input images and projects them as visual tokens that flow through both standard and expert pathways in parallel.

Self-Hosting & Configuration

  • Requires a GPU with at least 40 GB VRAM for the full 19B model
  • INT4 quantized versions run on GPUs with 16 GB VRAM
  • Load models from Hugging Face with the trust_remote_code flag enabled
  • Configure generation parameters including temperature, top_p, and max output length
  • Deploy with Gradio for a web-based demo interface

Key Features

  • Visual expert architecture preserves language model quality while adding vision capabilities
  • Achieves state-of-the-art results on 10+ multimodal benchmarks including VQAv2 and POPE
  • Supports both image and video understanding in the CogVLM2-Video variant
  • Grounding mode outputs bounding boxes for referenced objects in images
  • Multiple model sizes from 8B to 19B parameters for different hardware profiles

Comparison with Similar Tools

  • InternVL — Similar benchmark performance with more model size options; CogVLM has a unique expert architecture
  • LLaVA — Pioneering VLM with simpler architecture; CogVLM offers deeper visual reasoning
  • Qwen-VL — Strong bilingual support; CogVLM excels at visual grounding tasks
  • GPT-4V — Proprietary with broader capabilities; CogVLM is fully open-source and self-hostable

FAQ

Q: What image formats does CogVLM accept? A: Standard formats including JPEG, PNG, and BMP are supported through PIL image loading.

Q: Can CogVLM run on consumer GPUs? A: The INT4 quantized version runs on GPUs with 16 GB VRAM such as the RTX 4080.

Q: Does CogVLM support batch inference? A: Yes, multiple images can be processed in batches for higher throughput.

Q: Is CogVLM suitable for document understanding? A: It handles basic document and screenshot reading, though dedicated document models may perform better on complex layouts.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产