Segmentation · Vision Foundation Models

Segment Anything: Promptable Segmentation at Web Scale

SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.

TL;DR

SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.

What problem it solves

Segmentation systems were usually trained for specific categories, domains, or annotation styles. Segment Anything asks whether segmentation can become a general promptable capability: click a point, draw a box, provide a mask cue, and get a useful object mask.

The core method

SAM combines an image encoder, a prompt encoder, and a mask decoder. The model is trained with a large data engine that produces SA-1B, a dataset of more than one billion masks. This scale lets the model generalize to many objects and visual domains.

Key results

SAM can generate masks from different prompt types and often transfers to new images without task-specific retraining. The paper’s important contribution is both the model and the dataset, because the data engine makes promptable segmentation trainable at unusual scale.

Why it matters

SAM turned segmentation into infrastructure. Annotation tools, image editors, robotics systems, medical workflows, and data pipelines can all use a general mask proposal model as a starting point, then layer domain-specific checks on top.

Limits and open questions

SAM is not a full understanding system. It can produce plausible masks without knowing object identity, function, or safety relevance. Domain shift, tiny structures, medical imagery, transparent objects, and temporal consistency still require careful evaluation.

One line: SAM made segmentation feel like an interactive foundation-model primitive.