ConfigsMay 2, 2026·3 min read

SAM 2 — Segment Anything in Images and Videos

Meta's next-generation Segment Anything Model that extends promptable segmentation from images to videos. SAM 2 tracks and segments objects across video frames in real-time with a unified architecture.

Introduction

SAM 2 (Segment Anything Model 2) extends Meta's original SAM from static images to streaming video. It introduces a memory mechanism that allows the model to track and segment objects across frames, handling occlusions, reappearances, and object deformation.

What SAM 2 Does

  • Segments objects in both images and videos with point, box, or mask prompts
  • Tracks segmented objects across video frames with temporal consistency
  • Handles occlusion and object reappearance using a memory bank
  • Supports interactive refinement of masks on any frame during processing
  • Provides the SA-V dataset with 642K masklets across 51K videos

Architecture Overview

SAM 2 uses a Hiera image encoder for per-frame feature extraction, a memory attention module that conditions current-frame predictions on past frames and prompted frames stored in a memory bank, and the same lightweight mask decoder from SAM. A memory encoder writes per-frame predictions back to the bank for future reference. This streaming architecture processes video frame by frame without requiring the full video in memory.

Self-Hosting & Configuration

  • Requires Python 3.10+ and PyTorch 2.3.1+
  • Multiple checkpoint sizes: Hiera-T (39M), Hiera-S, Hiera-B+, Hiera-L (224M)
  • GPU with 8 GB VRAM sufficient for the base model
  • Jupyter notebook demos included for both image and video workflows
  • Supports ONNX export for edge deployment

Key Features

  • Unified architecture handles both image and video segmentation
  • 6x faster than SAM on images due to the more efficient Hiera backbone
  • Memory mechanism enables real-time video object tracking
  • SA-V dataset is 53x larger than prior video segmentation datasets
  • Interactive prompting allows corrections at any video frame

Comparison with Similar Tools

  • SAM (v1) — image-only segmentation; SAM 2 adds video tracking and a faster backbone
  • XMem — strong video object segmentation baseline; SAM 2 adds promptable interaction and better generalization
  • Cutie — semi-supervised video segmentation; SAM 2 supports zero-shot prompting without per-video training
  • Track Anything Model (TAM) — combines SAM with tracking heuristics; SAM 2 integrates tracking natively

FAQ

Q: Can SAM 2 run on live camera feeds? A: The streaming architecture processes frames sequentially and can work with live feeds given sufficient GPU throughput.

Q: Is SAM 2 backward compatible with SAM? A: SAM 2 handles images as single-frame videos and outperforms SAM v1 on image segmentation benchmarks.

Q: What video formats are supported? A: The model processes extracted frames (JPEG/PNG). Video decoding is handled separately before inference.

Q: How long can processed videos be? A: There is no hard limit. The memory bank uses a fixed window, so arbitrarily long videos can be processed in streaming fashion.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets