Skills2026年5月2日·1 分钟阅读

SAM 2 — Segment Anything in Images and Videos

Meta's next-generation Segment Anything Model that extends promptable segmentation from images to videos. SAM 2 tracks and segments objects across video frames in real-time with a unified architecture.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
SAM 2 Overview
通用 CLI 安装命令
npx tokrepo install c9dc9efb-45df-11f1-9bc6-00163e2b0d79

Introduction

SAM 2 (Segment Anything Model 2) extends Meta's original SAM from static images to streaming video. It introduces a memory mechanism that allows the model to track and segment objects across frames, handling occlusions, reappearances, and object deformation.

What SAM 2 Does

  • Segments objects in both images and videos with point, box, or mask prompts
  • Tracks segmented objects across video frames with temporal consistency
  • Handles occlusion and object reappearance using a memory bank
  • Supports interactive refinement of masks on any frame during processing
  • Provides the SA-V dataset with 642K masklets across 51K videos

Architecture Overview

SAM 2 uses a Hiera image encoder for per-frame feature extraction, a memory attention module that conditions current-frame predictions on past frames and prompted frames stored in a memory bank, and the same lightweight mask decoder from SAM. A memory encoder writes per-frame predictions back to the bank for future reference. This streaming architecture processes video frame by frame without requiring the full video in memory.

Self-Hosting & Configuration

  • Requires Python 3.10+ and PyTorch 2.3.1+
  • Multiple checkpoint sizes: Hiera-T (39M), Hiera-S, Hiera-B+, Hiera-L (224M)
  • GPU with 8 GB VRAM sufficient for the base model
  • Jupyter notebook demos included for both image and video workflows
  • Supports ONNX export for edge deployment

Key Features

  • Unified architecture handles both image and video segmentation
  • 6x faster than SAM on images due to the more efficient Hiera backbone
  • Memory mechanism enables real-time video object tracking
  • SA-V dataset is 53x larger than prior video segmentation datasets
  • Interactive prompting allows corrections at any video frame

Comparison with Similar Tools

  • SAM (v1) — image-only segmentation; SAM 2 adds video tracking and a faster backbone
  • XMem — strong video object segmentation baseline; SAM 2 adds promptable interaction and better generalization
  • Cutie — semi-supervised video segmentation; SAM 2 supports zero-shot prompting without per-video training
  • Track Anything Model (TAM) — combines SAM with tracking heuristics; SAM 2 integrates tracking natively

FAQ

Q: Can SAM 2 run on live camera feeds? A: The streaming architecture processes frames sequentially and can work with live feeds given sufficient GPU throughput.

Q: Is SAM 2 backward compatible with SAM? A: SAM 2 handles images as single-frame videos and outperforms SAM v1 on image segmentation benchmarks.

Q: What video formats are supported? A: The model processes extracted frames (JPEG/PNG). Video decoding is handled separately before inference.

Q: How long can processed videos be? A: There is no hard limit. The memory bank uses a fixed window, so arbitrarily long videos can be processed in streaming fashion.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产