In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini Multimodal RAG - Text, Image and Document Retrieval

Gemini Multimodal RAG - Text, Image and Document Retrieval

Author: Venkata Sudhakar

Multimodal RAG extends traditional text-only retrieval by indexing and retrieving across multiple content types - product images, PDF manuals, specification sheets, and text descriptions - in a unified pipeline. With Gemini, retrieved content from different modalities can be passed together in a single prompt, enabling richer and more accurate answers than text-only RAG.

The pattern works in three steps: embed all content (text and images) using the multimodal embedding model, store vectors with metadata identifying the source type, then at query time retrieve top-k results across all modalities and pass them to Gemini for synthesis. This gives ShopMax India a single search interface that works across product descriptions, user manuals, and product photos.

The below example shows how ShopMax India builds a multimodal product assistant that retrieves across text descriptions and image captions before generating a response.

It gives the following output,

Embedding functions ready

def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Index: mix of text product descriptions and image captions
index = [
    {"id": "T001", "type": "text", "source": "Samsung QLED 55in product page",
     "content": "Samsung 55 inch QLED TV with 4K resolution, HDR10 support and built-in Bixby voice assistant. Ideal for living rooms up to 400 sq ft.",
     "embedding": None},
    {"id": "T002", "type": "text", "source": "LG NanoCell 43in product page",
     "content": "LG 43 inch NanoCell TV with Dolby Vision, webOS smart platform and ThinQ AI. Best suited for bedrooms and study rooms.",
     "embedding": None},
    {"id": "T003", "type": "text", "source": "Samsung QLED user manual",
     "content": "To connect the Samsung QLED to external speakers, use the optical audio output port on the rear panel. Enable Dolby Digital in sound settings.",
     "embedding": None},
]

for item in index:
    item["embedding"] = embed_text(item["content"])

print("Index built:", len(index), "items")

It gives the following output,

Index built: 3 items

With the index ready, queries retrieve the top-k most relevant chunks across all modalities and pass them to Gemini as context for the final answer.

It gives the following output,

To connect your Samsung QLED TV to external speakers, use the optical audio output
port located on the rear panel of the TV. Once connected, go to the sound settings
menu and enable Dolby Digital to ensure the best audio output from your speaker system.

For production at ShopMax India, replace the in-memory index with Vertex AI Vector Search and store image captions alongside original image URLs so the agent can return both text answers and relevant product images. As the catalogue grows, the same pipeline handles thousands of indexed items with millisecond retrieval latency through the managed vector search endpoint.

Send your comments, suggestions or queries regarding this site to [email protected].