# SAM 2 — Segment Anything in Images and Videos

> Meta's next-generation Segment Anything Model that extends promptable segmentation from images to videos. SAM 2 tracks and segments objects across video frames in real-time with a unified architecture.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# SAM 2 — Segment Anything in Images and Videos

## Quick Use
```bash
pip install sam-2
python -c "
import torch
from sam2.build_sam import build_sam2_video_predictor
predictor = build_sam2_video_predictor('sam2.1_hiera_large.yaml',
    'checkpoints/sam2.1_hiera_large.pt', device='cuda')
"
```

## Introduction
SAM 2 (Segment Anything Model 2) extends Meta's original SAM from static images to streaming video. It introduces a memory mechanism that allows the model to track and segment objects across frames, handling occlusions, reappearances, and object deformation.

## What SAM 2 Does
- Segments objects in both images and videos with point, box, or mask prompts
- Tracks segmented objects across video frames with temporal consistency
- Handles occlusion and object reappearance using a memory bank
- Supports interactive refinement of masks on any frame during processing
- Provides the SA-V dataset with 642K masklets across 51K videos

## Architecture Overview
SAM 2 uses a Hiera image encoder for per-frame feature extraction, a memory attention module that conditions current-frame predictions on past frames and prompted frames stored in a memory bank, and the same lightweight mask decoder from SAM. A memory encoder writes per-frame predictions back to the bank for future reference. This streaming architecture processes video frame by frame without requiring the full video in memory.

## Self-Hosting & Configuration
- Requires Python 3.10+ and PyTorch 2.3.1+
- Multiple checkpoint sizes: Hiera-T (39M), Hiera-S, Hiera-B+, Hiera-L (224M)
- GPU with 8 GB VRAM sufficient for the base model
- Jupyter notebook demos included for both image and video workflows
- Supports ONNX export for edge deployment

## Key Features
- Unified architecture handles both image and video segmentation
- 6x faster than SAM on images due to the more efficient Hiera backbone
- Memory mechanism enables real-time video object tracking
- SA-V dataset is 53x larger than prior video segmentation datasets
- Interactive prompting allows corrections at any video frame

## Comparison with Similar Tools
- **SAM (v1)** — image-only segmentation; SAM 2 adds video tracking and a faster backbone
- **XMem** — strong video object segmentation baseline; SAM 2 adds promptable interaction and better generalization
- **Cutie** — semi-supervised video segmentation; SAM 2 supports zero-shot prompting without per-video training
- **Track Anything Model (TAM)** — combines SAM with tracking heuristics; SAM 2 integrates tracking natively

## FAQ
**Q: Can SAM 2 run on live camera feeds?**
A: The streaming architecture processes frames sequentially and can work with live feeds given sufficient GPU throughput.

**Q: Is SAM 2 backward compatible with SAM?**
A: SAM 2 handles images as single-frame videos and outperforms SAM v1 on image segmentation benchmarks.

**Q: What video formats are supported?**
A: The model processes extracted frames (JPEG/PNG). Video decoding is handled separately before inference.

**Q: How long can processed videos be?**
A: There is no hard limit. The memory bank uses a fixed window, so arbitrarily long videos can be processed in streaming fashion.

## Sources
- https://github.com/facebookresearch/sam2
- https://ai.meta.com/sam2/

---
Source: https://tokrepo.com/en/workflows/sam-2-segment-anything-images-videos-c9dc9efb
Author: AI Open Source