# DINOv2 — Self-Supervised Visual Features by Meta > DINOv2 produces universal visual features via self-supervised learning on curated data, providing strong image representations for classification, segmentation, and depth estimation without fine-tuning. ## Install Save as a script file and run: # DINOv2 — Self-Supervised Visual Features by Meta ## Quick Use ```bash pip install torch torchvision python -c " import torch dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14') dinov2.eval() # Extract features from an image tensor img = torch.randn(1, 3, 518, 518) features = dinov2(img) print(features.shape) " ``` ## Introduction DINOv2 is a family of self-supervised vision transformer models from Meta AI that learn powerful visual features without any labeled data. Trained on a curated dataset of 142 million images, DINOv2 models serve as versatile visual backbones for tasks ranging from image classification to monocular depth estimation. ## What DINOv2 Does - Produces general-purpose visual features that transfer to many downstream tasks - Supports image classification, semantic segmentation, depth estimation, and retrieval - Provides models in four sizes: ViT-S/14, ViT-B/14, ViT-L/14, and ViT-g/14 - Offers dense patch-level features useful for pixel-level tasks - Works out of the box as a frozen feature extractor with a simple linear head ## Architecture Overview DINOv2 uses Vision Transformer (ViT) architectures with a 14x14 patch size. Training combines a self-distillation loss (student-teacher framework from DINO) with a masked image modeling loss (inspired by iBOT). The teacher network is updated via exponential moving average of the student weights. A key contribution is the automated data curation pipeline that builds a high-quality 142M image dataset from uncurated web data using self-supervised retrieval and deduplication. ## Self-Hosting & Configuration - Install PyTorch 2.0+ and load models via torch.hub or Hugging Face - ViT-B/14 (86M parameters) runs on consumer GPUs with 4 GB VRAM for inference - ViT-g/14 (1.1B parameters) requires 8 GB+ VRAM - Models accept images at 518x518 resolution (37x37 patches) by default - Registers variant (dinov2_vitb14_reg) adds register tokens for smoother feature maps ## Key Features - State-of-the-art self-supervised visual features across multiple benchmarks - Frozen features match or exceed fine-tuned task-specific models on many tasks - Curated training data pipeline eliminates the need for labeled datasets - Dense patch features enable pixel-level downstream applications - Register tokens reduce artifacts in attention maps for dense prediction ## Comparison with Similar Tools - **CLIP** — contrastive vision-language model with text alignment but less spatial detail - **MAE** — masked autoencoder learns good features but requires fine-tuning for best results - **SAM** — segment anything model focuses on segmentation masks rather than general features - **EVA-02** — similar ViT backbone with CLIP distillation and masked modeling - **SigLIP** — sigmoid-based contrastive learning with strong zero-shot but weaker dense features ## FAQ **Q: Can DINOv2 be used for zero-shot classification?** A: DINOv2 alone does not do zero-shot classification since it lacks text alignment. You need to train a linear classifier on the extracted features or combine with a text encoder. **Q: What resolution should input images be?** A: The default resolution is 518x518 pixels (37 patches of 14x14). Other resolutions work but may affect feature quality. **Q: How does DINOv2 compare to CLIP for retrieval?** A: DINOv2 excels at visual similarity retrieval based on appearance. CLIP is better when you need semantic text-image matching. **Q: Are the models available on Hugging Face?** A: Yes, all DINOv2 model variants are published on the Hugging Face Hub under the facebookresearch organization. ## Sources - https://github.com/facebookresearch/dinov2 - https://dinov2.metademolab.com/ --- Source: https://tokrepo.com/en/workflows/asset-b1a4009a Author: Script Depot