Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsMay 21, 2026·3 min de lecture

DINOv2 — Self-Supervised Visual Features by Meta

DINOv2 produces universal visual features via self-supervised learning on curated data, providing strong image representations for classification, segmentation, and depth estimation without fine-tuning.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
DINOv2 Overview
Commande CLI universelle
npx tokrepo install b1a4009a-54cb-11f1-9bc6-00163e2b0d79

Introduction

DINOv2 is a family of self-supervised vision transformer models from Meta AI that learn powerful visual features without any labeled data. Trained on a curated dataset of 142 million images, DINOv2 models serve as versatile visual backbones for tasks ranging from image classification to monocular depth estimation.

What DINOv2 Does

  • Produces general-purpose visual features that transfer to many downstream tasks
  • Supports image classification, semantic segmentation, depth estimation, and retrieval
  • Provides models in four sizes: ViT-S/14, ViT-B/14, ViT-L/14, and ViT-g/14
  • Offers dense patch-level features useful for pixel-level tasks
  • Works out of the box as a frozen feature extractor with a simple linear head

Architecture Overview

DINOv2 uses Vision Transformer (ViT) architectures with a 14x14 patch size. Training combines a self-distillation loss (student-teacher framework from DINO) with a masked image modeling loss (inspired by iBOT). The teacher network is updated via exponential moving average of the student weights. A key contribution is the automated data curation pipeline that builds a high-quality 142M image dataset from uncurated web data using self-supervised retrieval and deduplication.

Self-Hosting & Configuration

  • Install PyTorch 2.0+ and load models via torch.hub or Hugging Face
  • ViT-B/14 (86M parameters) runs on consumer GPUs with 4 GB VRAM for inference
  • ViT-g/14 (1.1B parameters) requires 8 GB+ VRAM
  • Models accept images at 518x518 resolution (37x37 patches) by default
  • Registers variant (dinov2_vitb14_reg) adds register tokens for smoother feature maps

Key Features

  • State-of-the-art self-supervised visual features across multiple benchmarks
  • Frozen features match or exceed fine-tuned task-specific models on many tasks
  • Curated training data pipeline eliminates the need for labeled datasets
  • Dense patch features enable pixel-level downstream applications
  • Register tokens reduce artifacts in attention maps for dense prediction

Comparison with Similar Tools

  • CLIP — contrastive vision-language model with text alignment but less spatial detail
  • MAE — masked autoencoder learns good features but requires fine-tuning for best results
  • SAM — segment anything model focuses on segmentation masks rather than general features
  • EVA-02 — similar ViT backbone with CLIP distillation and masked modeling
  • SigLIP — sigmoid-based contrastive learning with strong zero-shot but weaker dense features

FAQ

Q: Can DINOv2 be used for zero-shot classification? A: DINOv2 alone does not do zero-shot classification since it lacks text alignment. You need to train a linear classifier on the extracted features or combine with a text encoder.

Q: What resolution should input images be? A: The default resolution is 518x518 pixels (37 patches of 14x14). Other resolutions work but may affect feature quality.

Q: How does DINOv2 compare to CLIP for retrieval? A: DINOv2 excels at visual similarity retrieval based on appearance. CLIP is better when you need semantic text-image matching.

Q: Are the models available on Hugging Face? A: Yes, all DINOv2 model variants are published on the Hugging Face Hub under the facebookresearch organization.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires