# OpenCLIP — Open-Source Contrastive Language-Image Pre-training

> Community-driven reproduction and extension of OpenAI CLIP, providing open training code, datasets, and pretrained models for contrastive vision-language learning at scale.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# OpenCLIP — Open-Source Contrastive Language-Image Pre-training

## Quick Use
```bash
pip install open_clip_torch
python -c "
import open_clip, torch
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')
image = preprocess(Image.open('photo.jpg')).unsqueeze(0)
text = tokenizer(['a dog', 'a cat'])
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    probs = (image_features @ text_features.T).softmax(dim=-1)
print(probs)
"
```

## Introduction
OpenCLIP is an open-source implementation of CLIP (Contrastive Language-Image Pre-training) that provides reproducible training pipelines and pretrained models trained on publicly available datasets like LAION-2B. It enables zero-shot image classification, image-text retrieval, and serves as a foundation for multimodal AI applications.

## What OpenCLIP Does
- Trains vision-language models using contrastive learning on image-text pairs
- Provides pretrained models across multiple architectures (ViT-B, ViT-L, ViT-H, ViT-G)
- Enables zero-shot image classification without task-specific fine-tuning
- Generates aligned image and text embeddings for retrieval and similarity tasks
- Supports distributed training across multiple GPUs and nodes

## Architecture Overview
OpenCLIP pairs a vision transformer (or CNN) image encoder with a text transformer encoder. Both encoders project their outputs into a shared embedding space via learned linear projections. Contrastive loss maximizes cosine similarity between matching image-text pairs while minimizing it for non-matching pairs within each batch. Training uses large-batch distributed optimization with gradient checkpointing and mixed precision.

## Self-Hosting & Configuration
- Install via pip: `pip install open_clip_torch`
- Download pretrained models automatically via model name and pretrained tag
- Training requires multi-GPU setup and webdataset-formatted image-text pairs
- Configure architecture, dataset, batch size, and learning rate via CLI arguments
- Supports FSDP and DeepSpeed for scaling to billions of training samples

## Key Features
- Fully open training code reproducing CLIP results on public data
- Model zoo with checkpoints trained on LAION-400M, LAION-2B, and DataComp
- Zero-shot transfer to downstream tasks without fine-tuning
- CoCa (Contrastive Captioner) models that combine contrastive and captioning objectives
- Integration with Hugging Face model hub for easy model sharing

## Comparison with Similar Tools
- **OpenAI CLIP** — original model with closed training data; OpenCLIP uses public datasets
- **SigLIP** — Google's sigmoid-loss variant; available through OpenCLIP's codebase
- **BLIP-2** — adds generative capabilities on top of frozen image encoders
- **EVA-CLIP** — enhanced training recipes for CLIP models at larger scale
- **MetaCLIP** — Meta's data curation approach for CLIP training

## FAQ
**Q: How does OpenCLIP differ from the original CLIP?**
A: OpenCLIP provides open training code and models trained on publicly available datasets, while OpenAI CLIP was trained on proprietary data. Some OpenCLIP models match or exceed original CLIP performance.

**Q: What is the largest available model?**
A: ViT-bigG-14 trained on LAION-2B, achieving strong zero-shot performance across benchmarks.

**Q: Can I fine-tune OpenCLIP on my own data?**
A: Yes. The training scripts support both from-scratch training and fine-tuning from pretrained checkpoints.

**Q: What formats are supported for training data?**
A: WebDataset tar files with image-text pairs, or CSV files pointing to image paths and captions.

## Sources
- https://github.com/mlfoundations/open_clip
- https://laion.ai/blog/large-openclip/

---
Source: https://tokrepo.com/en/workflows/openclip-open-source-contrastive-language-image-pre-training-cc727315
Author: AI Open Source