|
|
Gemini Multimodal Embeddings
Author: Venkata Sudhakar
Gemini Multimodal Embeddings generate vector representations from both text and images. Unlike text-only embeddings, these vectors capture the semantic meaning across modalities, enabling cross-modal search - for example, finding products using a photo or describing an image in words. ShopMax India uses this for its visual product search feature. The Gemini embedding API supports three task types for multimodal content: RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, and SEMANTIC_SIMILARITY. You can embed images directly as base64 or via URI, combined with text in the same request. The below example shows how to generate embeddings for product images and enable image-to-text cross-modal search.
It gives the following output,
Embedding dimensions: 768
First 5 values: [ 0.0231 -0.0412 0.0187 0.0563 -0.0298]
The below example shows cross-modal search - a customer types a text query and the system finds visually matching products from the catalogue.
It gives the following output,
Samsung 55 inch 4K TV: similarity = 0.8724
ShopMax India built its visual search feature using Gemini Multimodal Embeddings. Customers in Mumbai and Bangalore can photograph a product they saw in a store and instantly find the closest match in the ShopMax catalogue - increasing product discovery by 35% and reducing search abandonment significantly.
|
|