Segment Anything (SAM) — Foundation Model for Image Segmentation

Introduction

Segment Anything (SAM) is a foundation model for image segmentation released by Meta AI Research. It was trained on the SA-1B dataset containing over one billion masks and can segment any object in an image without task-specific fine-tuning, making it a general-purpose building block for computer vision pipelines.

What Segment Anything Does

Segments any object in an image given point, box, or text prompts
Generates multiple valid masks with confidence scores for ambiguous prompts
Runs zero-shot on new image domains without retraining
Produces high-quality masks at interactive speeds on GPU
Serves as a backbone for downstream tasks like video segmentation, medical imaging, and 3D reconstruction

Architecture Overview

SAM consists of three components: an image encoder (ViT-based), a flexible prompt encoder that handles points, boxes, and free-form text, and a lightweight mask decoder. The image encoder runs once per image, and the prompt encoder plus mask decoder run per query, enabling real-time interactive segmentation. The model outputs per-mask IoU scores for automatic quality ranking.

Self-Hosting & Configuration

Install via pip and download model checkpoints (ViT-H, ViT-L, or ViT-B) from the repository
Requires PyTorch 1.7+ and a CUDA GPU for efficient inference
The SamPredictor class provides a simple API for single-image segmentation
SamAutomaticMaskGenerator generates masks for all objects in an image without prompts
Integrates with OpenCV and PIL for image I/O

Key Features

Zero-shot generalization to unseen object types and visual domains
Interactive prompting with points, bounding boxes, or masks
Automatic mask generation mode for full-scene segmentation
Three model sizes (ViT-B, ViT-L, ViT-H) trading speed for accuracy
Permissive Apache 2.0 license for commercial use

Comparison with Similar Tools

SAM 2 — extends SAM to video with streaming memory; higher temporal consistency
Grounding DINO — open-set object detection; often paired with SAM for text-prompted segmentation
Detectron2 — full detection/segmentation framework; requires task-specific training
YOLO — optimized for real-time detection; less precise per-pixel masks
U-Net — classic medical segmentation; needs domain-specific labeled data

FAQ

Q: Can SAM run on CPU? A: Yes, but inference is significantly slower. GPU is recommended for interactive use.

Q: Does SAM understand semantic categories? A: SAM segments objects by spatial prompts, not semantic labels. Pair it with a classifier or Grounding DINO for labeled segmentation.

Q: What image formats are supported? A: Any format readable by PIL or OpenCV, including JPEG, PNG, TIFF, and BMP.

Q: Can I fine-tune SAM on custom data? A: The model weights can be fine-tuned using standard PyTorch training loops, though zero-shot performance is strong for most use cases.

Segment Anything (SAM) — Foundation Model for Image Segmentation

Introduction

What Segment Anything Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Hugging Face Tokenizers — Fast Text Tokenization for ML Pipelines

Cleanlab — Find and Fix Label Errors in Any ML Dataset

Hugging Face Datasets — Access and Process ML Datasets at Scale