Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsJun 2, 2026·3 min de lecture

CogVLM — Open Visual Language Model with Deep Visual Understanding

A multimodal pretrained model that excels at image captioning, visual question answering, and visual grounding with strong benchmark performance.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
CogVLM Overview
Commande d'installation directe
npx -y tokrepo@latest install 5c1dd4f8-5e1a-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

CogVLM is an open-source visual language model from Tsinghua University and Zhipu AI that achieves strong performance on visual understanding benchmarks. It uses a visual expert module to bridge vision and language representations without degrading the underlying language model capabilities.

What CogVLM Does

  • Answers questions about image content with detailed natural language responses
  • Performs visual grounding by identifying and locating objects referenced in text
  • Generates image captions and descriptions with contextual understanding
  • Supports multi-turn dialogue with persistent visual context
  • Handles OCR-like tasks including reading text from images and screenshots

Architecture Overview

CogVLM introduces a trainable visual expert module in each transformer layer that processes visual tokens through dedicated attention and FFN weights. This design preserves the original language model weights while adding visual understanding capacity through the expert pathway. The vision encoder extracts features from input images and projects them as visual tokens that flow through both standard and expert pathways in parallel.

Self-Hosting & Configuration

  • Requires a GPU with at least 40 GB VRAM for the full 19B model
  • INT4 quantized versions run on GPUs with 16 GB VRAM
  • Load models from Hugging Face with the trust_remote_code flag enabled
  • Configure generation parameters including temperature, top_p, and max output length
  • Deploy with Gradio for a web-based demo interface

Key Features

  • Visual expert architecture preserves language model quality while adding vision capabilities
  • Achieves state-of-the-art results on 10+ multimodal benchmarks including VQAv2 and POPE
  • Supports both image and video understanding in the CogVLM2-Video variant
  • Grounding mode outputs bounding boxes for referenced objects in images
  • Multiple model sizes from 8B to 19B parameters for different hardware profiles

Comparison with Similar Tools

  • InternVL — Similar benchmark performance with more model size options; CogVLM has a unique expert architecture
  • LLaVA — Pioneering VLM with simpler architecture; CogVLM offers deeper visual reasoning
  • Qwen-VL — Strong bilingual support; CogVLM excels at visual grounding tasks
  • GPT-4V — Proprietary with broader capabilities; CogVLM is fully open-source and self-hostable

FAQ

Q: What image formats does CogVLM accept? A: Standard formats including JPEG, PNG, and BMP are supported through PIL image loading.

Q: Can CogVLM run on consumer GPUs? A: The INT4 quantized version runs on GPUs with 16 GB VRAM such as the RTX 4080.

Q: Does CogVLM support batch inference? A: Yes, multiple images can be processed in batches for higher throughput.

Q: Is CogVLM suitable for document understanding? A: It handles basic document and screenshot reading, though dedicated document models may perform better on complex layouts.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires