Scripts2026年5月21日·1 分钟阅读

Depth Anything V2 — Monocular Depth Estimation at Scale

Depth Anything V2 is a family of monocular depth estimation models that predict accurate relative depth maps from single RGB images, trained on a massive dataset of 62 million labeled images.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Depth Anything V2 Overview
通用 CLI 安装命令
npx tokrepo install 0219e18b-54cc-11f1-9bc6-00163e2b0d79

Introduction

Depth Anything V2 is a monocular depth estimation foundation model developed by researchers at the University of Hong Kong and TikTok. It predicts per-pixel relative depth from a single RGB image with high detail and edge preservation, trained through a pipeline that leverages both large-scale unlabeled data and precise synthetic depth labels.

What Depth Anything V2 Does

  • Predicts dense relative depth maps from single RGB images
  • Provides models in three sizes: ViT-S (24.8M), ViT-B (97.5M), and ViT-L (335.3M)
  • Handles diverse scenes including indoor, outdoor, close-up, and wide-angle views
  • Generates metric depth estimates when fine-tuned with metric depth labels
  • Supports video depth estimation with temporal consistency processing

Architecture Overview

Depth Anything V2 uses a DPT (Dense Prediction Transformer) architecture with a DINOv2 backbone as the encoder. The key training innovation is a two-stage pipeline: first, a teacher model is trained on precise synthetic depth data from virtual environments; then, the teacher generates pseudo-labels for 62 million real-world unlabeled images. The student model learns from this combined synthetic and pseudo-labeled dataset, achieving both the precision of synthetic labels and the diversity of real-world data.

Self-Hosting & Configuration

  • Clone the repository and download pre-trained model weights from Hugging Face
  • ViT-S model runs on GPUs with as little as 2 GB VRAM for single-image inference
  • ViT-L model requires 4 GB+ VRAM and delivers the highest accuracy
  • Inference runs at 30+ FPS with the small model on consumer GPUs
  • Metric depth fine-tuned models available for indoor and outdoor use cases

Key Features

  • Precise edge-aware depth predictions with sharp boundaries between objects
  • Synthetic-to-real training pipeline scales without manual depth annotations
  • Three model sizes for flexible accuracy-speed tradeoffs
  • Fine-grained detail preservation for small and thin structures
  • Compatible with downstream 3D tasks like novel view synthesis and point cloud generation

Comparison with Similar Tools

  • MiDaS — Intel monocular depth model, pioneered the field but lower accuracy than V2
  • ZoeDepth — combines relative and metric depth estimation, less scalable training
  • Marigold — diffusion-based depth with high detail but significantly slower inference
  • Metric3D — metric depth estimation focused on scale-aware predictions
  • UniDepth — universal depth model aiming for metric depth across camera types

FAQ

Q: What is the difference between relative and metric depth? A: Relative depth predicts which pixels are closer or farther without absolute scale. Metric depth estimates actual distances in meters and requires calibration or fine-tuning.

Q: Can Depth Anything V2 process video? A: Yes, process frames individually for basic use. The project also provides a video-specific pipeline with temporal smoothing for consistent depth across frames.

Q: How does V2 improve over V1? A: V2 replaces the DINOv2-based self-training with a synthetic-to-real pipeline, yielding sharper depth edges, fewer artifacts, and improved accuracy on benchmarks.

Q: Does it work on images with transparent or reflective surfaces? A: Performance degrades on highly reflective, transparent, or featureless surfaces, as these are inherently ambiguous for monocular depth estimation.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产