# SAM 2 — Segment Anything in Images and Videos > Meta's next-generation Segment Anything Model that extends promptable segmentation from images to videos. SAM 2 tracks and segments objects across video frames in real-time with a unified architecture. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # SAM 2 — Segment Anything in Images and Videos ## Quick Use ```bash pip install sam-2 python -c " import torch from sam2.build_sam import build_sam2_video_predictor predictor = build_sam2_video_predictor('sam2.1_hiera_large.yaml', 'checkpoints/sam2.1_hiera_large.pt', device='cuda') " ``` ## Introduction SAM 2 (Segment Anything Model 2) extends Meta's original SAM from static images to streaming video. It introduces a memory mechanism that allows the model to track and segment objects across frames, handling occlusions, reappearances, and object deformation. ## What SAM 2 Does - Segments objects in both images and videos with point, box, or mask prompts - Tracks segmented objects across video frames with temporal consistency - Handles occlusion and object reappearance using a memory bank - Supports interactive refinement of masks on any frame during processing - Provides the SA-V dataset with 642K masklets across 51K videos ## Architecture Overview SAM 2 uses a Hiera image encoder for per-frame feature extraction, a memory attention module that conditions current-frame predictions on past frames and prompted frames stored in a memory bank, and the same lightweight mask decoder from SAM. A memory encoder writes per-frame predictions back to the bank for future reference. This streaming architecture processes video frame by frame without requiring the full video in memory. ## Self-Hosting & Configuration - Requires Python 3.10+ and PyTorch 2.3.1+ - Multiple checkpoint sizes: Hiera-T (39M), Hiera-S, Hiera-B+, Hiera-L (224M) - GPU with 8 GB VRAM sufficient for the base model - Jupyter notebook demos included for both image and video workflows - Supports ONNX export for edge deployment ## Key Features - Unified architecture handles both image and video segmentation - 6x faster than SAM on images due to the more efficient Hiera backbone - Memory mechanism enables real-time video object tracking - SA-V dataset is 53x larger than prior video segmentation datasets - Interactive prompting allows corrections at any video frame ## Comparison with Similar Tools - **SAM (v1)** — image-only segmentation; SAM 2 adds video tracking and a faster backbone - **XMem** — strong video object segmentation baseline; SAM 2 adds promptable interaction and better generalization - **Cutie** — semi-supervised video segmentation; SAM 2 supports zero-shot prompting without per-video training - **Track Anything Model (TAM)** — combines SAM with tracking heuristics; SAM 2 integrates tracking natively ## FAQ **Q: Can SAM 2 run on live camera feeds?** A: The streaming architecture processes frames sequentially and can work with live feeds given sufficient GPU throughput. **Q: Is SAM 2 backward compatible with SAM?** A: SAM 2 handles images as single-frame videos and outperforms SAM v1 on image segmentation benchmarks. **Q: What video formats are supported?** A: The model processes extracted frames (JPEG/PNG). Video decoding is handled separately before inference. **Q: How long can processed videos be?** A: There is no hard limit. The memory bank uses a fixed window, so arbitrarily long videos can be processed in streaming fashion. ## Sources - https://github.com/facebookresearch/sam2 - https://ai.meta.com/sam2/ --- Source: https://tokrepo.com/en/workflows/sam-2-segment-anything-images-videos-c9dc9efb Author: AI Open Source