A promptable foundation model that segments any object in any image.
Segment Anything Model (SAM) is a promptable image segmentation system developed by Meta AI Research and released in 2023. Unlike conventional segmentation models trained to recognize a fixed set of object categories, SAM is designed as a foundation model for segmentation — a general-purpose system capable of isolating any object in an image given a minimal prompt such as a point, bounding box, or free-form text. This zero-shot generalization is made possible by training on SA-1B, a dataset of over one billion segmentation masks across eleven million images, the largest segmentation dataset ever assembled at the time of release.
SAM's architecture consists of three components: an image encoder based on a Vision Transformer (ViT) that computes a dense embedding of the input image, a prompt encoder that processes spatial or textual cues, and a lightweight mask decoder that combines both representations to produce one or more candidate segmentation masks. Because the image embedding is computed once and cached, SAM can respond to new prompts in real time, enabling interactive workflows where users iteratively refine segmentations with additional clicks or corrections.
The model's design philosophy centers on the concept of "promptability" — the idea that a single model should handle the full diversity of segmentation tasks by accepting flexible user intent rather than requiring task-specific fine-tuning. This mirrors how large language models handle diverse text tasks through prompting, and positions SAM as an analogous foundation model for visual perception. The approach enables downstream applications ranging from medical image analysis and satellite imagery interpretation to augmented reality and robotic scene understanding.
SAM's release had broad practical and research impact, spurring extensions such as SAM 2, which generalizes the approach to video by tracking and segmenting objects across frames. Its open-source availability accelerated adoption across scientific and engineering domains, and it established a template for how foundation models could be applied to structured prediction tasks in computer vision beyond classification and detection.