Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 2, 2026·3 min de lectura

SAM 2 — Segment Anything in Images and Videos

Meta's next-generation Segment Anything Model that extends promptable segmentation from images to videos. SAM 2 tracks and segments objects across video frames in real-time with a unified architecture.

Introduction

SAM 2 (Segment Anything Model 2) extends Meta's original SAM from static images to streaming video. It introduces a memory mechanism that allows the model to track and segment objects across frames, handling occlusions, reappearances, and object deformation.

What SAM 2 Does

  • Segments objects in both images and videos with point, box, or mask prompts
  • Tracks segmented objects across video frames with temporal consistency
  • Handles occlusion and object reappearance using a memory bank
  • Supports interactive refinement of masks on any frame during processing
  • Provides the SA-V dataset with 642K masklets across 51K videos

Architecture Overview

SAM 2 uses a Hiera image encoder for per-frame feature extraction, a memory attention module that conditions current-frame predictions on past frames and prompted frames stored in a memory bank, and the same lightweight mask decoder from SAM. A memory encoder writes per-frame predictions back to the bank for future reference. This streaming architecture processes video frame by frame without requiring the full video in memory.

Self-Hosting & Configuration

  • Requires Python 3.10+ and PyTorch 2.3.1+
  • Multiple checkpoint sizes: Hiera-T (39M), Hiera-S, Hiera-B+, Hiera-L (224M)
  • GPU with 8 GB VRAM sufficient for the base model
  • Jupyter notebook demos included for both image and video workflows
  • Supports ONNX export for edge deployment

Key Features

  • Unified architecture handles both image and video segmentation
  • 6x faster than SAM on images due to the more efficient Hiera backbone
  • Memory mechanism enables real-time video object tracking
  • SA-V dataset is 53x larger than prior video segmentation datasets
  • Interactive prompting allows corrections at any video frame

Comparison with Similar Tools

  • SAM (v1) — image-only segmentation; SAM 2 adds video tracking and a faster backbone
  • XMem — strong video object segmentation baseline; SAM 2 adds promptable interaction and better generalization
  • Cutie — semi-supervised video segmentation; SAM 2 supports zero-shot prompting without per-video training
  • Track Anything Model (TAM) — combines SAM with tracking heuristics; SAM 2 integrates tracking natively

FAQ

Q: Can SAM 2 run on live camera feeds? A: The streaming architecture processes frames sequentially and can work with live feeds given sufficient GPU throughput.

Q: Is SAM 2 backward compatible with SAM? A: SAM 2 handles images as single-frame videos and outperforms SAM v1 on image segmentation benchmarks.

Q: What video formats are supported? A: The model processes extracted frames (JPEG/PNG). Video decoding is handled separately before inference.

Q: How long can processed videos be? A: There is no hard limit. The memory bank uses a fixed window, so arbitrarily long videos can be processed in streaming fashion.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados