What is SAM 2 — Segment Anything in Images and Videos?

Meta's next-generation Segment Anything Model that extends promptable segmentation from images to videos. SAM 2 tracks and segments objects across video frames in real-time with a unified architecture.

Is SAM 2 — Segment Anything in Images and Videos free to use?

Yes. SAM 2 — Segment Anything in Images and Videos is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install SAM 2 — Segment Anything in Images and Videos?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

SAM 2 — Segment Anything in Images and Videos

Introduction

SAM 2 (Segment Anything Model 2) extends Meta's original SAM from static images to streaming video. It introduces a memory mechanism that allows the model to track and segment objects across frames, handling occlusions, reappearances, and object deformation.

What SAM 2 Does

Segments objects in both images and videos with point, box, or mask prompts
Tracks segmented objects across video frames with temporal consistency
Handles occlusion and object reappearance using a memory bank
Supports interactive refinement of masks on any frame during processing
Provides the SA-V dataset with 642K masklets across 51K videos

Architecture Overview

SAM 2 uses a Hiera image encoder for per-frame feature extraction, a memory attention module that conditions current-frame predictions on past frames and prompted frames stored in a memory bank, and the same lightweight mask decoder from SAM. A memory encoder writes per-frame predictions back to the bank for future reference. This streaming architecture processes video frame by frame without requiring the full video in memory.

Self-Hosting & Configuration

Requires Python 3.10+ and PyTorch 2.3.1+
Multiple checkpoint sizes: Hiera-T (39M), Hiera-S, Hiera-B+, Hiera-L (224M)
GPU with 8 GB VRAM sufficient for the base model
Jupyter notebook demos included for both image and video workflows
Supports ONNX export for edge deployment

Key Features

Unified architecture handles both image and video segmentation
6x faster than SAM on images due to the more efficient Hiera backbone
Memory mechanism enables real-time video object tracking
SA-V dataset is 53x larger than prior video segmentation datasets
Interactive prompting allows corrections at any video frame

Comparison with Similar Tools

SAM (v1) — image-only segmentation; SAM 2 adds video tracking and a faster backbone
XMem — strong video object segmentation baseline; SAM 2 adds promptable interaction and better generalization
Cutie — semi-supervised video segmentation; SAM 2 supports zero-shot prompting without per-video training
Track Anything Model (TAM) — combines SAM with tracking heuristics; SAM 2 integrates tracking natively

FAQ

Q: Can SAM 2 run on live camera feeds? A: The streaming architecture processes frames sequentially and can work with live feeds given sufficient GPU throughput.

Q: Is SAM 2 backward compatible with SAM? A: SAM 2 handles images as single-frame videos and outperforms SAM v1 on image segmentation benchmarks.

Q: What video formats are supported? A: The model processes extracted frames (JPEG/PNG). Video decoding is handled separately before inference.

Q: How long can processed videos be? A: There is no hard limit. The memory bank uses a fixed window, so arbitrarily long videos can be processed in streaming fashion.

SAM 2 — Segment Anything in Images and Videos

Introduction

What SAM 2 Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

GPT-NeoX — Open-Source Large Language Model Training Library

LLaVA — Large Language and Vision Assistant