ScriptsMay 21, 2026·3 min read

DINOv2 — Self-Supervised Visual Features by Meta

DINOv2 produces universal visual features via self-supervised learning on curated data, providing strong image representations for classification, segmentation, and depth estimation without fine-tuning.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
DINOv2 Overview
Universal CLI install command
npx tokrepo install b1a4009a-54cb-11f1-9bc6-00163e2b0d79

Introduction

DINOv2 is a family of self-supervised vision transformer models from Meta AI that learn powerful visual features without any labeled data. Trained on a curated dataset of 142 million images, DINOv2 models serve as versatile visual backbones for tasks ranging from image classification to monocular depth estimation.

What DINOv2 Does

  • Produces general-purpose visual features that transfer to many downstream tasks
  • Supports image classification, semantic segmentation, depth estimation, and retrieval
  • Provides models in four sizes: ViT-S/14, ViT-B/14, ViT-L/14, and ViT-g/14
  • Offers dense patch-level features useful for pixel-level tasks
  • Works out of the box as a frozen feature extractor with a simple linear head

Architecture Overview

DINOv2 uses Vision Transformer (ViT) architectures with a 14x14 patch size. Training combines a self-distillation loss (student-teacher framework from DINO) with a masked image modeling loss (inspired by iBOT). The teacher network is updated via exponential moving average of the student weights. A key contribution is the automated data curation pipeline that builds a high-quality 142M image dataset from uncurated web data using self-supervised retrieval and deduplication.

Self-Hosting & Configuration

  • Install PyTorch 2.0+ and load models via torch.hub or Hugging Face
  • ViT-B/14 (86M parameters) runs on consumer GPUs with 4 GB VRAM for inference
  • ViT-g/14 (1.1B parameters) requires 8 GB+ VRAM
  • Models accept images at 518x518 resolution (37x37 patches) by default
  • Registers variant (dinov2_vitb14_reg) adds register tokens for smoother feature maps

Key Features

  • State-of-the-art self-supervised visual features across multiple benchmarks
  • Frozen features match or exceed fine-tuned task-specific models on many tasks
  • Curated training data pipeline eliminates the need for labeled datasets
  • Dense patch features enable pixel-level downstream applications
  • Register tokens reduce artifacts in attention maps for dense prediction

Comparison with Similar Tools

  • CLIP — contrastive vision-language model with text alignment but less spatial detail
  • MAE — masked autoencoder learns good features but requires fine-tuning for best results
  • SAM — segment anything model focuses on segmentation masks rather than general features
  • EVA-02 — similar ViT backbone with CLIP distillation and masked modeling
  • SigLIP — sigmoid-based contrastive learning with strong zero-shot but weaker dense features

FAQ

Q: Can DINOv2 be used for zero-shot classification? A: DINOv2 alone does not do zero-shot classification since it lacks text alignment. You need to train a linear classifier on the extracted features or combine with a text encoder.

Q: What resolution should input images be? A: The default resolution is 518x518 pixels (37 patches of 14x14). Other resolutions work but may affect feature quality.

Q: How does DINOv2 compare to CLIP for retrieval? A: DINOv2 excels at visual similarity retrieval based on appearance. CLIP is better when you need semantic text-image matching.

Q: Are the models available on Hugging Face? A: Yes, all DINOv2 model variants are published on the Hugging Face Hub under the facebookresearch organization.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets