Google Launches Multimodal Gemini Embedding 2
- •Google DeepMind debuts Gemini Embedding 2, a natively multimodal model for unified media processing.
- •The system maps text, images, video, and audio into a single, shared semantic embedding space.
- •Matryoshka Representation Learning integration allows flexible dimension scaling to optimize performance and storage costs.
Google has officially introduced Gemini Embedding 2, a breakthrough model designed to bridge the gap between different data formats by mapping text, images, video, audio, and documents into a single, unified embedding space. Unlike previous systems that required separate models for each media type, this natively multimodal approach captures complex relationships across more than 100 languages. This simplifies the creation of advanced AI pipelines, making it easier for developers to build tools like semantic search engines that can simultaneously analyze the content of a video alongside a related technical document.
The model offers significant technical flexibility, supporting up to 8,192 text tokens, 120 seconds of video, and multi-page documents. One of its standout features is the use of Matryoshka Representation Learning (MRL), a clever technique that allows developers to scale down the size of the numerical representations—known as embeddings—from the default 3072 dimensions to smaller sizes. This helps balance high performance with lower storage costs, ensuring that even large-scale applications remain efficient and cost-effective for developers managing massive datasets.
By ingesting audio directly without needing text transcripts and processing interleaved inputs—where images and text are analyzed together—Gemini Embedding 2 mimics a more human-like understanding of information. Currently available in public preview via the Gemini API and Vertex AI, the model is already being integrated into popular developer frameworks. This release marks a significant step toward more seamless, multimodal AI experiences that can navigate the diverse and unstructured data of the real world.