TL;DR: Google has released Gemini Embedding 2 in public preview (since March 10, 2026) — the first natively multimodal embedding model that encodes text, images, video, audio, and PDF documents into a single unified vector space. It supports Matryoshka Representation Learning for flexible output dimensions (3072/1536/768), handles inputs up to 8,192 text tokens, six images, 120 seconds of video, and six-page PDFs in one call, and has already cut retrieval latency by up to 70% for early enterprise adopters. Available now via the Gemini API and Vertex AI.
Table of contents
- What changed: the shift from siloed to unified embeddings
- How the multimodal architecture works
- Technical specifications in full
- Matryoshka dimensions: flexible vectors by design
- Latency improvements: where the 70% number comes from
- How to access it: Gemini API and Vertex AI
- Enterprise use cases driving adoption
- How it compares to OpenAI, Cohere, and Voyage AI
- Developer integration guide
- What this means for search and RAG systems
- 15 frequently asked questions
What changed: the shift from siloed to unified embeddings
The embedding model market has operated on a simple assumption for years: different modalities need different models. You use one model to embed text, a separate CLIP-style model for images, a video encoder for video frames, and so on. Then you reconcile the resulting vectors — usually with some alignment layer or late-fusion trick — to do cross-modal retrieval.
Google's Gemini Embedding 2 challenges that assumption directly. Instead of using separate encoder towers per modality and attempting post-hoc alignment, this model was trained from the ground up to map all five modalities — text, images, video, audio, and documents — into a single shared vector space. You embed a photo and a paragraph about that photo, and the resulting vectors land close together in geometric terms without any special bridging logic.
That is not a trivial engineering accomplishment. It requires the training objective, the data mixture, and the architecture to jointly account for the fact that "a red car" as a text string and a JPEG of a red car are semantically the same thing, even though the raw input formats are completely different. Models trained on one modality at a time never learn that correspondence natively; they approximate it through projection layers trained afterward.
What Google has shipped here is the productized output of years of work on Gemini's multimodal foundation, applied to the specific problem of dense retrieval and semantic search. The bet is that enterprises running multimodal data pipelines — product catalogs, media archives, support ticket systems with screenshots, video content libraries — no longer need to maintain separate embedding infrastructure per content type.
The practical implication is significant: one model, one vector index, one similarity search over everything.
How the multimodal architecture works
Google has not published a detailed technical paper accompanying the Gemini Embedding 2 launch, but the architecture details disclosed through its blog and API documentation give a clear enough picture of how the system works.
The model is built on the Gemini family's core multimodal encoder, which processes interleaved sequences of tokens regardless of modality. Text is tokenized conventionally. Images are encoded as patch sequences using a vision encoder that feeds into the same transformer backbone. Video is handled as a sequence of sampled frames, with temporal ordering preserved in the embedding. Audio is processed through a spectrogram-based encoding pathway before entering the shared representation stack. PDFs are rendered as a combination of extracted text and page images, allowing the model to handle both text-heavy documents and visually structured layouts like tables and charts.
The key architectural decision is that all these modality encoders feed into a shared representation trunk — the same transformer layers that produce the final embedding vector. This is what makes the space genuinely unified rather than retrofitted. The final pooled vector reflects a common semantic geometry regardless of the input format.
Training used a contrastive objective across modalities: positive pairs consisted of semantically related content across different formats (a video clip and its transcript, an image and its caption, a document and a related search query), while negatives were drawn from unrelated content. This forces the model to learn alignment through training signal rather than post-processing.
The result is that a query expressed in natural language can retrieve images, video segments, audio clips, or documents — and vice versa. A product image can retrieve matching product descriptions. A PDF slide can retrieve related video presentations. Cross-modal semantic search becomes a standard vector similarity operation.
Technical specifications in full
The input constraints for Gemini Embedding 2 are specific and worth noting before you start building:
Text: Up to 8,192 tokens per input. This covers most document chunks, long-form queries, and extended paragraphs without needing to split aggressively.
Images: Up to 6 images per call, in PNG or JPEG format. Multiple images in a single embedding call are treated as a joint semantic unit — useful for product listings with multiple photos or multi-page visual documents.
Video: Up to 120 seconds (2 minutes) of MP4 or MOV video per call. The model samples frames at intervals and processes the temporal sequence. For longer video content, you would need to chunk into 2-minute segments and index those separately.
Audio: Native audio input is supported, though specific codec and duration constraints are documented separately in the API reference. This is notable — most multimodal embedding systems treat audio as a second-class citizen or require preprocessing into spectrograms or transcripts before embedding.
Documents: Up to 6-page PDFs per call. Multi-page PDFs are handled as a unified semantic unit, not as individual page embeddings concatenated. This makes it practical for embedding standard business documents — contracts, reports, presentations — without pre-splitting them.
Output dimensions: 3072 (full), 1536 (medium), 768 (compact). Selectable at inference time via Matryoshka truncation (more on this below).
Availability: Public preview since March 10, 2026. Available via the Gemini Developer API and Google Cloud Vertex AI.
Matryoshka dimensions: flexible vectors by design
Matryoshka Representation Learning (MRL) is a training technique that embeds a nesting property into the vector output: the first N dimensions of a 3072-dimensional vector are themselves a valid, semantically coherent embedding at dimension N, for any N in a specified set.
This matters practically because vector databases have real storage and compute costs at scale. A 3072-dimensional vector takes up roughly 12KB at float32. At 100 million documents, that is 1.2 terabytes of vector storage before indexing overhead. Dropping to 768 dimensions reduces that to 300GB — a 75% reduction — with some precision loss but often acceptable recall at the application level.
With MRL, you get that flexibility without re-embedding. You train once at 3072 dimensions. At query time, you truncate to 1536 or 768 and run similarity search at the reduced dimension. The model was explicitly optimized for performance at all three checkpoints, so truncation does not cause the kind of catastrophic quality degradation you would see if you simply dropped the last dimensions of a standard embedding.
The practical workflow this enables: start with 768 dimensions in development or for less critical retrieval paths, then switch to 3072 for high-stakes production use cases like legal document search or financial report retrieval where recall quality is non-negotiable. No re-indexing required when you want to compare performance tiers — you already have the data.
For enterprises with existing vector infrastructure on a specific dimension count, the three options also increase compatibility with common index configurations without requiring padding or projection.
Latency improvements: where the 70% number comes from
Google has cited a 70% latency reduction for some customers, which is a significant claim. Understanding where it comes from requires thinking about what the alternative architecture looked like.
An enterprise with a multimodal data pipeline before Gemini Embedding 2 would typically run multiple embedding models: a text embedding model (perhaps Google's own text-embedding-004 or a third-party like Cohere or Voyage), an image embedding model (CLIP or a fine-tuned variant), and potentially a separate document parser that extracts text before embedding. Each of these is a separate inference call, often to separate endpoints, with separate model loading overhead and separate API round trips.
To embed a support ticket that includes a text description and three screenshots, you might make four separate embedding calls: one for text, three for images. Then you store four vectors and manage the multi-vector retrieval logic yourself.
Gemini Embedding 2 collapses that into a single API call. One call, one vector, one index entry. The latency saving is not primarily from the model being faster per token — it is from eliminating the serial or parallel overhead of multiple API calls and their associated network, queuing, and initialization costs.
For high-throughput pipelines processing thousands of mixed-modality documents per hour, this matters considerably. The total wall-clock time for embedding a batch of 10,000 mixed-content items can drop from multiple minutes to under a minute depending on the previous architecture.
The 70% figure likely reflects the most favorable comparison — customers who were running the most fragmented multi-model stacks before. Customers who were already using a reasonably optimized multimodal pipeline, or who are primarily text-heavy, will see smaller gains.
How to access it: Gemini API and Vertex AI
Gemini Embedding 2 is available through two routes:
Gemini Developer API: Accessible via https://generativelanguage.googleapis.com. The model identifier is gemini-embedding-2-0. Authentication uses standard Google API keys. Rate limits apply based on your Google Cloud project tier.
Vertex AI: For enterprise deployments with existing Google Cloud contracts, Vertex AI offers Gemini Embedding 2 with enterprise SLAs, private endpoints, VPC Service Controls, and CMEK (Customer-Managed Encryption Keys) support. Billing is through Google Cloud usage-based pricing.
The API interface follows the standard embedContent structure. For multimodal inputs, the request body accepts a parts array where each part specifies its content type — text string, inline image data, file URI for video or audio from Google Cloud Storage, or file data for PDFs. The response returns a single embedding vector at the requested dimension.
For production workloads, Vertex AI is the recommended path. The Developer API is primarily for experimentation and prototyping. The public preview status means the API surface may change before general availability, though Google's track record suggests the core interface is stable.
Enterprise use cases driving adoption
The design of Gemini Embedding 2 maps onto several enterprise pain points that have been difficult to address cleanly with separate-model approaches:
Multimodal RAG systems: Retrieval-Augmented Generation pipelines that need to pull context from mixed-format knowledge bases — text documents, PDF manuals, instructional videos, product images — can now use a single embedding index. A support chatbot that needs to retrieve relevant troubleshooting videos, product photos, and text documentation in response to a customer query no longer requires separate retrieval pipelines per content type.
E-commerce product search: Product catalogs contain images, text descriptions, specifications, and often video demonstrations. Cross-modal semantic search — where a shopper's text query retrieves relevant products based on image similarity as well as text match — has previously required complex multi-tower retrieval architectures. A unified embedding space simplifies this to standard approximate nearest neighbor search.
Media archive retrieval: News organizations, production studios, and media companies maintain large archives with video, audio, image, and text content. Unified embeddings allow semantic search across all formats simultaneously — finding all content related to a specific event, person, or topic regardless of whether it is a video clip, a transcript, an image, or an article.
Compliance and legal document search: Legal teams deal with documents that combine text, tables, charts, and scanned pages. Unified PDF embedding handles visually complex documents that pure text extraction pipelines would mangle.
Content moderation: Classifying content that combines text and images — social media posts, forum threads, product reviews with photos — into semantic categories is a natural fit for unified embeddings, since the model captures the joint meaning of the combined inputs.
How it compares to OpenAI, Cohere, and Voyage AI
The embedding model market has several serious competitors, each with distinct positioning:
OpenAI text-embedding-3-large: OpenAI's current flagship embedding model is text-only. It supports Matryoshka-style variable dimensions (introduced in early 2024) and performs well on text retrieval benchmarks. It does not support image, video, audio, or document inputs natively. For multimodal use cases, developers using OpenAI need to combine it with CLIP or a separate vision encoder.
Cohere embed-v3.5: Cohere's latest embedding model is primarily text-focused with some image support added recently. It is competitive on text retrieval benchmarks and is popular in enterprise deployments for its strong multilingual coverage and its integration with Cohere's reranking models. It does not offer native video, audio, or full document embedding in a single model.
Voyage AI voyage-multimodal-3: Voyage AI has been building specifically toward multimodal embeddings and is arguably the closest direct competitor to Gemini Embedding 2. voyage-multimodal-3 supports text and images in a unified space and has strong benchmark performance. It does not yet support video or audio natively, and document handling requires preprocessing.
The gap: Gemini Embedding 2 is currently the only commercially available embedding model that handles all five modalities — text, images, video, audio, and documents — in a single inference call with a unified vector space. That is a meaningful lead in terms of capability coverage. The question for enterprises is whether Google's infrastructure, pricing, and reliability at scale match the capability advantage. That is harder to evaluate in public preview.
OpenAI and Cohere will almost certainly close the multimodal gap within the year. The competitive dynamics here resemble the text embedding race of 2023-2024: Google or a specialized player ships a capability, and the hyperscale players follow within 6-12 months.
Developer integration guide
Getting started with Gemini Embedding 2 is straightforward if you have used Google's embedding APIs before. The following covers the key integration patterns.
Basic text embedding:
Use the gemini-embedding-2-0 model with a standard embedContent request. The output dimension defaults to 3072; specify outputDimensionality in the request to get 1536 or 768.
Image embedding:
Pass image data as base64-encoded inline data or as a Google Cloud Storage URI in the parts array of your request. Set the MIME type to image/png or image/jpeg. Up to six images can be included in a single parts array for a joint embedding.
Video embedding:
Video must be provided as a GCS URI pointing to an MP4 or MOV file. Direct upload of video bytes is not supported in the current preview. Maximum 120 seconds per call; for longer videos, segment and embed separately.
PDF embedding:
Pass PDFs as file data with MIME type application/pdf. Up to six pages per call. For longer documents, chunk at page boundaries and embed each chunk separately, storing the page range in metadata for retrieval context.
Mixed-modality embedding:
The most powerful pattern is combining modalities in a single call. A parts array can include a text string, one or more images, and additional context to produce a single vector that captures the joint semantic meaning.
Choosing dimensions:
Start with 768 for development and cost sensitivity. Move to 1536 or 3072 when evaluating recall quality for your specific data distribution. Run offline recall benchmarks on a representative sample of your data before committing to an index dimension at scale.
Vector database compatibility:
The output vectors are standard float32 arrays. All major vector databases — Pinecone, Weaviate, Qdrant, pgvector, Vertex AI Vector Search — accept float32 arrays directly. No special integration layer is required.
What this means for search and RAG systems
The architectural implications of a unified multimodal embedding model for search and RAG systems are substantial, and worth thinking through carefully rather than treating as a drop-in replacement for existing pipelines.
Single-index multimodal retrieval: The most immediate implication is that you can maintain one vector index for all content types. Previously, a RAG system over a mixed-content knowledge base required either multiple separate indexes with a query routing layer, or a multi-vector retrieval strategy that fetched from each index and merged results. A single unified index simplifies both the architecture and the operational overhead.
Cross-modal query understanding: When a user asks "show me examples of the new dashboard UI from last quarter's product updates," a system backed by unified embeddings can retrieve relevant screenshots, demo videos, and product documentation simultaneously, ranked by semantic similarity to the query text. The retrieval layer does not need to know in advance that the answer might be a video rather than a document.
Chunking strategy changes: With text-only RAG, chunking strategy is primarily about managing context length — splitting documents into 512 or 1024 token chunks for embedding. With multimodal embeddings, chunking must account for modality-specific units: page boundaries for PDFs, scene boundaries for video, image groups for multi-photo content. The optimal chunking strategy becomes more content-aware.
Reranking integration: Current reranking models (Cohere Rerank, Vertex AI Rank) are primarily text-based. As multimodal retrieval matures, reranking will need to evolve to score mixed-modality result sets. For now, the most practical approach is to rerank using a text representation of each retrieved item (caption, transcript, metadata) while using the multimodal embedding for initial retrieval.
Evaluation complexity: Recall@K benchmarks for text retrieval are well-established. Evaluating multimodal retrieval quality is harder — the ground truth for "which video is most relevant to this text query" is more subjective and harder to label at scale. Production deployments will need investment in offline evaluation datasets that reflect actual query distributions.
The broader trend Gemini Embedding 2 accelerates is the shift from retrieval as a text problem to retrieval as a semantic problem that is inherently modality-agnostic. The semantic meaning of content does not change based on whether it was expressed in a paragraph or a photograph. Embedding models that treat this as a first-class property will enable fundamentally different kinds of search and knowledge management applications than those built on text-first architectures.
15 frequently asked questions
1. What is Gemini Embedding 2?
Gemini Embedding 2 is Google's natively multimodal embedding model that encodes text, images, video, audio, and PDF documents into a single unified vector space. It entered public preview on March 10, 2026.
2. What makes it "natively" multimodal?
The model was trained from the start to produce semantically aligned embeddings across all five modalities. It does not use separate encoder towers with post-hoc alignment — the shared representation space is a product of the training objective itself.
3. What are the input limits per API call?
8,192 text tokens, up to 6 images (PNG/JPEG), up to 120 seconds of video (MP4/MOV), native audio, and up to 6-page PDFs. Multiple modalities can be combined in a single call.
4. What output vector dimensions are available?
3072 (full), 1536 (medium), and 768 (compact). All three are valid output dimensions under the Matryoshka Representation Learning framework, meaning the model was explicitly optimized for quality at each checkpoint.
5. What is Matryoshka Representation Learning?
MRL is a training technique that gives the first N dimensions of a vector a valid lower-dimensional embedding property. This allows you to truncate a 3072-dimensional vector to 768 dimensions without re-embedding, enabling flexible storage and compute tradeoffs.
6. How does the 70% latency reduction work?
The reduction comes primarily from eliminating multiple separate embedding API calls for different modalities. Customers who previously made 3-4 calls per mixed-content item can now make one, cutting network overhead and inference orchestration time.
7. Is Gemini Embedding 2 available for free?
There is a free tier via the Gemini Developer API (rate limited). Enterprise usage via Vertex AI is billed per token/input, consistent with Google Cloud's usage-based pricing model.
8. How does it compare to OpenAI's text-embedding-3-large?
OpenAI's model is text-only. Gemini Embedding 2 covers five modalities. On text retrieval benchmarks, the two are competitive; Gemini Embedding 2's differentiation is multimodal coverage, not text-only retrieval performance.
9. Can I use it with Pinecone, Weaviate, or other vector databases?
Yes. The output is a standard float32 vector array compatible with all major vector databases. No special integration layer is required.
10. What happens with video longer than 120 seconds?
You need to segment the video into clips of 120 seconds or less and embed each segment separately. Store temporal metadata (start/end time) alongside each embedding for retrieval context.
11. Does it support multilingual text?
Yes. Gemini Embedding 2 inherits the multilingual capabilities of the Gemini model family. Specific language coverage details are documented in the API reference.
12. Is the model suitable for fine-tuning on domain-specific data?
Fine-tuning support for Gemini Embedding 2 has not been announced as part of the public preview. Domain adaptation would currently require prompt engineering or retrieval-time reranking rather than model fine-tuning.
13. What is the best use case for 768-dimensional output?
High-volume retrieval where storage and query cost matter more than marginal recall quality — product search, content deduplication, approximate similarity at scale. Use 3072 where recall precision is critical.
14. How do I handle PDFs with more than 6 pages?
Split into 6-page chunks and embed each chunk separately. Store page range metadata (e.g., pages 1-6, pages 7-12) for retrieval context so your RAG system can include page provenance in generated responses.
15. When will Gemini Embedding 2 reach general availability?
Google has not announced a GA date. Public preview typically precedes GA by 3-6 months on Google Cloud services, but the timeline for this specific model has not been disclosed.
Sources: Google Blog — Gemini Embedding 2, VentureBeat coverage, MarkTechPost analysis