Introduction
Segment Anything (SAM) is a foundation model for image segmentation released by Meta AI Research. It was trained on the SA-1B dataset containing over one billion masks and can segment any object in an image without task-specific fine-tuning, making it a general-purpose building block for computer vision pipelines.
What Segment Anything Does
- Segments any object in an image given point, box, or text prompts
- Generates multiple valid masks with confidence scores for ambiguous prompts
- Runs zero-shot on new image domains without retraining
- Produces high-quality masks at interactive speeds on GPU
- Serves as a backbone for downstream tasks like video segmentation, medical imaging, and 3D reconstruction
Architecture Overview
SAM consists of three components: an image encoder (ViT-based), a flexible prompt encoder that handles points, boxes, and free-form text, and a lightweight mask decoder. The image encoder runs once per image, and the prompt encoder plus mask decoder run per query, enabling real-time interactive segmentation. The model outputs per-mask IoU scores for automatic quality ranking.
Self-Hosting & Configuration
- Install via pip and download model checkpoints (ViT-H, ViT-L, or ViT-B) from the repository
- Requires PyTorch 1.7+ and a CUDA GPU for efficient inference
- The SamPredictor class provides a simple API for single-image segmentation
- SamAutomaticMaskGenerator generates masks for all objects in an image without prompts
- Integrates with OpenCV and PIL for image I/O
Key Features
- Zero-shot generalization to unseen object types and visual domains
- Interactive prompting with points, bounding boxes, or masks
- Automatic mask generation mode for full-scene segmentation
- Three model sizes (ViT-B, ViT-L, ViT-H) trading speed for accuracy
- Permissive Apache 2.0 license for commercial use
Comparison with Similar Tools
- SAM 2 — extends SAM to video with streaming memory; higher temporal consistency
- Grounding DINO — open-set object detection; often paired with SAM for text-prompted segmentation
- Detectron2 — full detection/segmentation framework; requires task-specific training
- YOLO — optimized for real-time detection; less precise per-pixel masks
- U-Net — classic medical segmentation; needs domain-specific labeled data
FAQ
Q: Can SAM run on CPU? A: Yes, but inference is significantly slower. GPU is recommended for interactive use.
Q: Does SAM understand semantic categories? A: SAM segments objects by spatial prompts, not semantic labels. Pair it with a classifier or Grounding DINO for labeled segmentation.
Q: What image formats are supported? A: Any format readable by PIL or OpenCV, including JPEG, PNG, TIFF, and BMP.
Q: Can I fine-tune SAM on custom data? A: The model weights can be fine-tuned using standard PyTorch training loops, though zero-shot performance is strong for most use cases.