- Weekly AI News
- Posts
- Meta’s SAM 2: The Next Generation of Video And Image Segmentation
Meta’s SAM 2: The Next Generation of Video And Image Segmentation
In-line With Mark Zuckerberg’s Vision for Open Source AI
Meta has just announced the Meta Segment Anything Model 2 (SAM 2), the latest iteration of their cutting-edge object segmentation model. SAM 2 extends its impressive capabilities to both videos and images and is available under the Apache 2.0 license, making it accessible for use in various projects. In addition, Meta is releasing the SA-V dataset under a CC BY 4.0 license and offering a web-based demo for users to experience the model in action.
Advancing Object Segmentation
Building on the foundation laid by the original SAM, which focused on segmenting objects in images, SAM 2 takes a significant leap forward.
It’s the first model to provide real-time, promptable object segmentation across both images and videos, enhancing the video segmentation experience and enabling seamless integration across various applications. SAM 2 not only outperforms previous models in image segmentation accuracy but also achieves superior video segmentation performance, requiring only a third of the interaction time compared to earlier versions.
Its zero-shot generalization capabilities allow it to segment any object in any video or image without needing custom adaptations.
Revolutionizing Segmentation with SAM
Before SAM’s release, creating accurate object segmentation models required technical specialists with access to advanced AI training infrastructure and large volumes of annotated data.
SAM revolutionized this process, enabling a broad range of real-world image segmentation tasks through prompting techniques.
Since its launch, SAM has significantly impacted multiple disciplines, inspiring new AI-powered features in Meta’s apps and diverse applications in science, medicine, and other fields.
Leading data annotation platforms now use SAM as the default tool for object segmentation, saving millions of hours of manual annotation work.
Mark Zuckerberg’s Vision for Open Source AI
In a recent open letter, Mark Zuckerberg emphasized the transformative potential of open source AI.
He highlighted that it can dramatically increase human productivity, creativity, and quality of life, while also driving economic growth and advancing groundbreaking medical and scientific research.
Meta is excited about the progress the AI community has made with SAM and believes SAM 2 will unlock even more exciting possibilities.
In line with their commitment to open science, Meta is sharing their research on SAM 2 with the community. The resources being released include:
SAM 2 code and weights: Open-sourced under the Apache 2.0 license, with evaluation code available under a BSD-3 license.
SA-V dataset: A dataset 4.5 times larger in videos and with 53 times more annotations than the previous largest video segmentation dataset, including around 51,000 real-world videos with over 600,000 masklets.
Web demo: Allows real-time interactive segmentation of short videos and applies video effects based on model predictions.
How SAM 2 Is Built

SAM 2, Meta’s latest iteration of their object segmentation model, extends its capabilities to both images and videos. By treating images as single-frame videos, SAM 2 can handle both input types seamlessly, leveraging a memory mechanism to recall previously processed information for accurate segmentation over time.
The model builds on SAM’s ability to take input points, boxes, or masks to predict segmentation masks, refining these predictions iteratively across video frames. The architecture includes key features such as promptable segmentation, a mask decoder, and a streaming architecture, enabling real-time processing and practical applications like robotics.
Additionally, SAM 2 handles ambiguity and occlusions effectively with multiple mask outputs and an “occlusion head.”
To support SAM 2, Meta developed the SA-V dataset using an interactive model-in-the-loop setup with human annotators, resulting in a significantly larger dataset with around 51,000 videos and over 600,000 masklets.
Trained on the SA-1B image dataset, the SA-V dataset, and an additional internal licensed video dataset, SAM 2 excels in both image and video segmentation tasks, offering superior performance, faster processing, and better handling of complex scenarios compared to previous models.
Key highlights include
Outperforming previous approaches on interactive video segmentation across 17 zero-shot video datasets.
Achieving better results than SAM on its 23 dataset zero-shot benchmark suite, while being six times faster.
Excelling at existing video object segmentation benchmarks.
Real-time inference at approximately 44 frames per second.
Fairness evaluation showing minimal performance discrepancy across demographic groups.
Limitations and Future Work
While SAM 2 demonstrates strong performance, there are areas for improvement, particularly in challenging scenarios like drastic camera viewpoint changes, long occlusions, crowded scenes, and extended videos.
SAM 2 is designed to be interactive, allowing for manual intervention to recover and track target objects accurately.
Future improvements could include enhancing temporal smoothness and further automating the data annotation process.
Putting SAM 2 to Work
Meta’s collaboration with Amazon SageMaker has enabled the successful release of SAM 2, pushing the boundaries of what’s possible on AWS AI Infrastructure.
This partnership allows Meta to focus on building state-of-the-art AI models and creating unique AI demo experiences.
Conclusion
SAM 2 represents a significant advancement in object segmentation, applicable to both static images and dynamic video content.
By sharing this technology and the accompanying resources with the AI community, Meta hopes to accelerate open science and inspire new innovations and use cases that benefit people and society.
If you want more updates related to AI, subscribe to our Newsletter
Reply