Meta Unveils Its Third-Generation "Segment Anything" Model: Grasping Concepts Through Prompts

11/17 2025 479

Moments ago, a post on X (formerly Twitter) brought attention to a research paper detailing Meta's SAM 3. This paper has been submitted for consideration at ICLR 2026.

SAM, short for "Segment Anything," is a model developed by Meta. It was introduced to the public in April 2023 and has since shown remarkable capabilities in the domains of natural language processing and computer vision.

The newly launched SAM 3 represents a unified model adept at detecting, segmenting, and tracking objects within both images and videos, guided by conceptual prompts.

Through Promptable Concept Segmentation (PCS), the model yields segmentation masks along with unique identifiers for every matching object instance. To bolster PCS, the research team constructed a scalable data engine capable of producing a high-quality dataset with 4 million distinct concept labels, encompassing both images and videos.

In terms of image and video PCS, SAM 3 demonstrates a twofold improvement over existing systems. It also enhances the performance of its predecessor, SAM, in interactive visual segmentation tasks.

Presently, SAM 3, along with the new "Segment Anything with Concepts (SA-Co)" benchmark, has been made openly accessible on major platforms.

SAM 3 builds upon SAM 2, incorporating support for the newly introduced PCS task as well as the existing PVS task. It utilizes conceptual prompts (such as simple noun phrases or image samples) or visual prompts (like points, boxes, or masks) to pinpoint objects requiring (spatiotemporal) segmentation on an individual basis.

The architecture of SAM 3 comprises a dual encoder-decoder transformer, coupled with a tracker and video memory. The detector and tracker receive visual-language input from an aligned Perception Encoder (PE) backbone network.

The team engineered an efficient data engine that iteratively generates annotated data through a feedback loop involving SAM 3, human annotators, and AI annotators. It actively seeks out media-phrase pairs for which the current version of SAM 3 cannot produce high-quality training data, aiming for further model refinement. This approach has more than doubled the throughput.

The training process is segmented into four stages:

Stage 1: Human Verification. Initially, data mining is conducted by randomly sampling images, and noun phrase proposals are generated using straightforward captioners and parsers.

Stage 2: Human-Machine Collaborative Verification. Leveraging the manually accepted or rejected labels from the MV and EV tasks gathered in Stage 1, Llama 3.2 is fine-tuned to create an AI verifier capable of automatically performing MV and EV tasks.

Stage 3: Scaling and Domain Expansion. AI models are employed to mine increasingly complex cases, and the domain coverage of SA-Co/HQ is broadened to include 15 datasets.

Stage 4: Video Annotation. This stage extends the data engine's capabilities to the video domain.

The team carried out benchmark tests on SAM 3, evaluating its performance in image/video segmentation, few-shot adaptive detection, and counting, as well as assessing its capabilities in complex language query segmentation using SAM 3 + MLLM.

The results indicate that zero-shot SAM 3 is highly competitive on closed-vocabulary COCO, COCO-O, and LVIS bounding boxes, and it performs even more impressively on LVIS masks.

On open-vocabulary SA-Co/Gold, SAM 3's CGFscore is double that of the strongest baseline, OWLv2, and it reaches 88% of the estimated lower bound of human performance.

The open-vocabulary semantic segmentation results on the ADE-847, PascalConcept-59, and Cityscapes datasets show that SAM 3 surpasses the strong specialized baseline, APE.

In comparison to MLLMs, SAM 3 not only attains high object counting accuracy but also provides object segmentation capabilities that most MLLMs lack.

In benchmark tests for text-prompted video PCS, SAM 3 exhibits exceptional performance, particularly on benchmark sets containing a large number of noun phrases.

SAM 3 achieves notable improvements over SAM 2 in most benchmarks, especially on the challenging MOSEv2 dataset, where it outperforms previous studies by 6 points. For interactive image segmentation tasks, SAM 3 exceeds SAM 2 in terms of average mIoU.

The main contributions outlined in the paper include:

1. Introducing the PCS task and the SA-Co benchmark.

2. Proposing a decoupled recognition-localization architecture that enhances the PCS capabilities of SAM 2 while preserving its PVS functionalities.

3. Developing a high-quality, efficient data engine that incorporates both human and AI annotators.

Researchers assert that SAM 3 and the SA-Co benchmark will serve as significant milestones, laying the groundwork for future research and applications in the field of computer vision.

References:

https://openreview.net/pdf?id=r35clVtGzw

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.