Introduction
SAM 2 (Segment Anything Model 2) extends Meta's original SAM from static images to streaming video. It introduces a memory mechanism that allows the model to track and segment objects across frames, handling occlusions, reappearances, and object deformation.
What SAM 2 Does
- Segments objects in both images and videos with point, box, or mask prompts
- Tracks segmented objects across video frames with temporal consistency
- Handles occlusion and object reappearance using a memory bank
- Supports interactive refinement of masks on any frame during processing
- Provides the SA-V dataset with 642K masklets across 51K videos
Architecture Overview
SAM 2 uses a Hiera image encoder for per-frame feature extraction, a memory attention module that conditions current-frame predictions on past frames and prompted frames stored in a memory bank, and the same lightweight mask decoder from SAM. A memory encoder writes per-frame predictions back to the bank for future reference. This streaming architecture processes video frame by frame without requiring the full video in memory.
Self-Hosting & Configuration
- Requires Python 3.10+ and PyTorch 2.3.1+
- Multiple checkpoint sizes: Hiera-T (39M), Hiera-S, Hiera-B+, Hiera-L (224M)
- GPU with 8 GB VRAM sufficient for the base model
- Jupyter notebook demos included for both image and video workflows
- Supports ONNX export for edge deployment
Key Features
- Unified architecture handles both image and video segmentation
- 6x faster than SAM on images due to the more efficient Hiera backbone
- Memory mechanism enables real-time video object tracking
- SA-V dataset is 53x larger than prior video segmentation datasets
- Interactive prompting allows corrections at any video frame
Comparison with Similar Tools
- SAM (v1) — image-only segmentation; SAM 2 adds video tracking and a faster backbone
- XMem — strong video object segmentation baseline; SAM 2 adds promptable interaction and better generalization
- Cutie — semi-supervised video segmentation; SAM 2 supports zero-shot prompting without per-video training
- Track Anything Model (TAM) — combines SAM with tracking heuristics; SAM 2 integrates tracking natively
FAQ
Q: Can SAM 2 run on live camera feeds? A: The streaming architecture processes frames sequentially and can work with live feeds given sufficient GPU throughput.
Q: Is SAM 2 backward compatible with SAM? A: SAM 2 handles images as single-frame videos and outperforms SAM v1 on image segmentation benchmarks.
Q: What video formats are supported? A: The model processes extracted frames (JPEG/PNG). Video decoding is handled separately before inference.
Q: How long can processed videos be? A: There is no hard limit. The memory bank uses a fixed window, so arbitrarily long videos can be processed in streaming fashion.