# Depth Anything V2 — Monocular Depth Estimation at Scale > Depth Anything V2 is a family of monocular depth estimation models that predict accurate relative depth maps from single RGB images, trained on a massive dataset of 62 million labeled images. ## Install Save as a script file and run: # Depth Anything V2 — Monocular Depth Estimation at Scale ## Quick Use ```bash pip install torch torchvision huggingface_hub python -c " from depth_anything_v2.dpt import DepthAnythingV2 import torch model = DepthAnythingV2(encoder='vitb', features=128, out_channels=[96, 192, 384, 768]) model.load_state_dict(torch.load('depth_anything_v2_vitb.pth', map_location='cpu')) model.eval() # Predict depth from an image import cv2 img = cv2.imread('photo.jpg') depth = model.infer_image(img) print(depth.shape) " ``` ## Introduction Depth Anything V2 is a monocular depth estimation foundation model developed by researchers at the University of Hong Kong and TikTok. It predicts per-pixel relative depth from a single RGB image with high detail and edge preservation, trained through a pipeline that leverages both large-scale unlabeled data and precise synthetic depth labels. ## What Depth Anything V2 Does - Predicts dense relative depth maps from single RGB images - Provides models in three sizes: ViT-S (24.8M), ViT-B (97.5M), and ViT-L (335.3M) - Handles diverse scenes including indoor, outdoor, close-up, and wide-angle views - Generates metric depth estimates when fine-tuned with metric depth labels - Supports video depth estimation with temporal consistency processing ## Architecture Overview Depth Anything V2 uses a DPT (Dense Prediction Transformer) architecture with a DINOv2 backbone as the encoder. The key training innovation is a two-stage pipeline: first, a teacher model is trained on precise synthetic depth data from virtual environments; then, the teacher generates pseudo-labels for 62 million real-world unlabeled images. The student model learns from this combined synthetic and pseudo-labeled dataset, achieving both the precision of synthetic labels and the diversity of real-world data. ## Self-Hosting & Configuration - Clone the repository and download pre-trained model weights from Hugging Face - ViT-S model runs on GPUs with as little as 2 GB VRAM for single-image inference - ViT-L model requires 4 GB+ VRAM and delivers the highest accuracy - Inference runs at 30+ FPS with the small model on consumer GPUs - Metric depth fine-tuned models available for indoor and outdoor use cases ## Key Features - Precise edge-aware depth predictions with sharp boundaries between objects - Synthetic-to-real training pipeline scales without manual depth annotations - Three model sizes for flexible accuracy-speed tradeoffs - Fine-grained detail preservation for small and thin structures - Compatible with downstream 3D tasks like novel view synthesis and point cloud generation ## Comparison with Similar Tools - **MiDaS** — Intel monocular depth model, pioneered the field but lower accuracy than V2 - **ZoeDepth** — combines relative and metric depth estimation, less scalable training - **Marigold** — diffusion-based depth with high detail but significantly slower inference - **Metric3D** — metric depth estimation focused on scale-aware predictions - **UniDepth** — universal depth model aiming for metric depth across camera types ## FAQ **Q: What is the difference between relative and metric depth?** A: Relative depth predicts which pixels are closer or farther without absolute scale. Metric depth estimates actual distances in meters and requires calibration or fine-tuning. **Q: Can Depth Anything V2 process video?** A: Yes, process frames individually for basic use. The project also provides a video-specific pipeline with temporal smoothing for consistent depth across frames. **Q: How does V2 improve over V1?** A: V2 replaces the DINOv2-based self-training with a synthetic-to-real pipeline, yielding sharper depth edges, fewer artifacts, and improved accuracy on benchmarks. **Q: Does it work on images with transparent or reflective surfaces?** A: Performance degrades on highly reflective, transparent, or featureless surfaces, as these are inherently ambiguous for monocular depth estimation. ## Sources - https://github.com/DepthAnything/Depth-Anything-V2 - https://depth-anything-v2.github.io/ --- Source: https://tokrepo.com/en/workflows/asset-0219e18b Author: Script Depot