|
|
Multi-Modal LLMs - Processing Text, Images and Audio Together
Author: Venkata Sudhakar
ShopMax India receives product queries with photos, voice messages, and text descriptions through the same customer support channel. Multi-modal LLMs like GPT-4o and Gemini 1.5 Flash process all these input types in a single API call, enabling unified pipelines without separate models for each modality.
Multi-modal LLM APIs accept content as an array of typed parts within a single message. Image parts encode the image as base64 or pass a URL. Text parts can accompany any combination of other modalities. The model reasons across all inputs simultaneously, producing a single coherent response that considers both the image and the text query together.
The below example shows how ShopMax India processes a customer support query that includes a product damage photo and a text question using GPT-4o.
It gives the following output,
The image shows clear physical damage - a cracked panel on the Samsung TV.
Order ORD-55321 assessment:
Damage type: Physical/impact damage
Warranty status: Physical damage is NOT covered under standard warranty.
Recommended action: This appears to be transit damage. Please escalate to
our Mumbai logistics team for a replacement claim under the delivery damage
policy within 48 hours of receipt.
Compress images before encoding to reduce token usage - GPT-4o resizes images internally but you are billed for input tokens. Use the detail: "low" option for simple categorisation tasks to reduce cost by up to 75%, and detail: "high" only when fine visual details matter. For ShopMax India, store multi-modal conversation history carefully - image tokens in history accumulate quickly and push context costs high for long conversations.
|
|