What is Segment Anything
Segment Anything is a new AI model from Meta AI that can "cut out" any object, in any image, with a single click. It is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.
Features of Segment Anything
- Promptable segmentation system with zero-shot generalization to unfamiliar objects and images
- Can take input prompts from other systems, such as in the future taking a user's gaze from an AR/VR headset to select an object
- Extensible outputs: output masks can be used as inputs to other AI systems
- Zero-shot generalization: has learned a general notion of what objects are, enabling zero-shot generalization to unfamiliar objects and images without requiring additional training
How to Use Segment Anything
- Select a prompt type: foreground/background points, bounding box, mask, or text prompts
- Input the prompt: click on the image, draw a bounding box, or input a text prompt
- Get the output: the model will generate a mask for the selected object
Training the Model
The model was trained on a dataset of 11 million images and 1.1 billion segmentation masks. The training process involved a model-in-the-loop "data engine" that interactively annotated images and updated the model.
Technical Details
- Model structure: a ViT-H image encoder, a prompt encoder, and a lightweight transformer-based mask decoder
- Platforms: the image encoder is implemented in PyTorch and requires a GPU for efficient inference, while the prompt encoder and mask decoder can run directly with PyTorch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime
- Model size: the image encoder has 632M parameters, while the prompt encoder and mask decoder have 4M parameters
- Inference time: the image encoder takes ~0.15 seconds on an NVIDIA A100 GPU, while the prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution
Helpful Tips
- Use the demo to try out the model and see how it works
- Read the paper and blog post to learn more about the model and its applications
- Explore the dataset and code on GitHub to learn more about the model and its training process
Frequently Asked Questions
- What type of prompts are supported? Foreground/background points, bounding box, mask, and text prompts
- What is the structure of the model? A ViT-H image encoder, a prompt encoder, and a lightweight transformer-based mask decoder
- What platforms does the model use? PyTorch and ONNX
- How big is the model? The image encoder has 632M parameters, while the prompt encoder and mask decoder have 4M parameters
- How long does inference take? The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU, while the prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution