ScriptsMay 21, 2026·3 min read

Depth Anything V2 — Monocular Depth Estimation at Scale

Depth Anything V2 is a family of monocular depth estimation models that predict accurate relative depth maps from single RGB images, trained on a massive dataset of 62 million labeled images.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
Depth Anything V2 Overview
Universal CLI install command
npx tokrepo install 0219e18b-54cc-11f1-9bc6-00163e2b0d79

Introduction

Depth Anything V2 is a monocular depth estimation foundation model developed by researchers at the University of Hong Kong and TikTok. It predicts per-pixel relative depth from a single RGB image with high detail and edge preservation, trained through a pipeline that leverages both large-scale unlabeled data and precise synthetic depth labels.

What Depth Anything V2 Does

  • Predicts dense relative depth maps from single RGB images
  • Provides models in three sizes: ViT-S (24.8M), ViT-B (97.5M), and ViT-L (335.3M)
  • Handles diverse scenes including indoor, outdoor, close-up, and wide-angle views
  • Generates metric depth estimates when fine-tuned with metric depth labels
  • Supports video depth estimation with temporal consistency processing

Architecture Overview

Depth Anything V2 uses a DPT (Dense Prediction Transformer) architecture with a DINOv2 backbone as the encoder. The key training innovation is a two-stage pipeline: first, a teacher model is trained on precise synthetic depth data from virtual environments; then, the teacher generates pseudo-labels for 62 million real-world unlabeled images. The student model learns from this combined synthetic and pseudo-labeled dataset, achieving both the precision of synthetic labels and the diversity of real-world data.

Self-Hosting & Configuration

  • Clone the repository and download pre-trained model weights from Hugging Face
  • ViT-S model runs on GPUs with as little as 2 GB VRAM for single-image inference
  • ViT-L model requires 4 GB+ VRAM and delivers the highest accuracy
  • Inference runs at 30+ FPS with the small model on consumer GPUs
  • Metric depth fine-tuned models available for indoor and outdoor use cases

Key Features

  • Precise edge-aware depth predictions with sharp boundaries between objects
  • Synthetic-to-real training pipeline scales without manual depth annotations
  • Three model sizes for flexible accuracy-speed tradeoffs
  • Fine-grained detail preservation for small and thin structures
  • Compatible with downstream 3D tasks like novel view synthesis and point cloud generation

Comparison with Similar Tools

  • MiDaS — Intel monocular depth model, pioneered the field but lower accuracy than V2
  • ZoeDepth — combines relative and metric depth estimation, less scalable training
  • Marigold — diffusion-based depth with high detail but significantly slower inference
  • Metric3D — metric depth estimation focused on scale-aware predictions
  • UniDepth — universal depth model aiming for metric depth across camera types

FAQ

Q: What is the difference between relative and metric depth? A: Relative depth predicts which pixels are closer or farther without absolute scale. Metric depth estimates actual distances in meters and requires calibration or fine-tuning.

Q: Can Depth Anything V2 process video? A: Yes, process frames individually for basic use. The project also provides a video-specific pipeline with temporal smoothing for consistent depth across frames.

Q: How does V2 improve over V1? A: V2 replaces the DINOv2-based self-training with a synthetic-to-real pipeline, yielding sharper depth edges, fewer artifacts, and improved accuracy on benchmarks.

Q: Does it work on images with transparent or reflective surfaces? A: Performance degrades on highly reflective, transparent, or featureless surfaces, as these are inherently ambiguous for monocular depth estimation.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets