Introduction
LLaVA (Large Language-and-Vision Assistant) is a multimodal AI model that combines a pre-trained CLIP vision encoder with a large language model through a simple projection layer. It enables conversational interactions about images, documents, and visual content.
What LLaVA Does
- Answers open-ended questions about images in natural language
- Describes, reasons about, and analyzes visual content in detail
- Supports multi-turn conversations with image context retention
- Provides a web-based demo, CLI interface, and API server
- Offers multiple model sizes from 7B to 34B parameters
Architecture Overview
LLaVA connects a frozen CLIP ViT-L/14 vision encoder to a LLaMA or Vicuna language model via a trainable linear projection layer. Image features are projected into the language model's token embedding space and concatenated with text tokens. Training proceeds in two stages: first aligning vision-language features on image-caption pairs, then instruction tuning on multimodal conversations.
Self-Hosting & Configuration
- Requires Python 3.10+ and PyTorch 2.0+
- Model weights are available on Hugging Face in multiple sizes
- 7B model needs approximately 16 GB VRAM; 13B needs 28 GB
- Gradio web UI available via
python -m llava.serve.gradio_web_server - Supports 4-bit and 8-bit quantization for reduced memory usage
Key Features
- Simple two-stage training: alignment then instruction tuning
- Achieves competitive scores on MMBench, SEED-Bench, and other multimodal benchmarks
- Multiple model variants: LLaVA-1.5, LLaVA-1.6 (LLaVA-NeXT) with dynamic resolution
- Efficient training requiring only 1 day on 8x A100 GPUs for the full pipeline
- Supports both Gradio web interface and OpenAI-compatible API serving
Comparison with Similar Tools
- GPT-4V — proprietary multimodal model with broader capabilities; LLaVA is fully open source and self-hostable
- InternVL — strong open-source alternative with different vision encoder choices
- Qwen-VL — Alibaba's multimodal model; competitive performance with different training data
- MiniGPT-4 — earlier open multimodal approach; LLaVA offers simpler architecture and better performance
FAQ
Q: Can LLaVA process video? A: The base model handles single images. LLaVA-NeXT-Video extends the architecture to video frames.
Q: What languages does LLaVA support? A: Primarily English, though the underlying LLM may handle other languages with reduced quality.
Q: Can I fine-tune LLaVA on custom data? A: Yes. The repository includes scripts for both stages of training on custom image-text datasets.
Q: How does LLaVA compare to commercial APIs? A: LLaVA-1.5-13B achieves results competitive with early GPT-4V on several benchmarks while running locally.