tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Large Language Models > Multi-Modal LLMs - Processing Text, Images and Audio Together

Multi-Modal LLMs - Processing Text, Images and Audio Together

Author: Venkata Sudhakar

ShopMax India receives product queries with photos, voice messages, and text descriptions through the same customer support channel. Multi-modal LLMs like GPT-4o and Gemini 1.5 Flash process all these input types in a single API call, enabling unified pipelines without separate models for each modality.

Multi-modal LLM APIs accept content as an array of typed parts within a single message. Image parts encode the image as base64 or pass a URL. Text parts can accompany any combination of other modalities. The model reasons across all inputs simultaneously, producing a single coherent response that considers both the image and the text query together.

The below example shows how ShopMax India processes a customer support query that includes a product damage photo and a text question using GPT-4o.


It gives the following output,

The image shows clear physical damage - a cracked panel on the Samsung TV.
Order ORD-55321 assessment:

Damage type: Physical/impact damage
Warranty status: Physical damage is NOT covered under standard warranty.
Recommended action: This appears to be transit damage. Please escalate to
our Mumbai logistics team for a replacement claim under the delivery damage
policy within 48 hours of receipt.

Compress images before encoding to reduce token usage - GPT-4o resizes images internally but you are billed for input tokens. Use the detail: "low" option for simple categorisation tasks to reduce cost by up to 75%, and detail: "high" only when fine visual details matter. For ShopMax India, store multi-modal conversation history carefully - image tokens in history accumulate quickly and push context costs high for long conversations.


 
  


  
bl  br