In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Large Language Models > Multi-Modal LLMs - Processing Text, Images and Audio Together

Multi-Modal LLMs - Processing Text, Images and Audio Together

Author: Venkata Sudhakar

ShopMax India receives product queries with photos, voice messages, and text descriptions through the same customer support channel. Multi-modal LLMs like GPT-4o and Gemini 1.5 Flash process all these input types in a single API call, enabling unified pipelines without separate models for each modality.

Multi-modal LLM APIs accept content as an array of typed parts within a single message. Image parts encode the image as base64 or pass a URL. Text parts can accompany any combination of other modalities. The model reasons across all inputs simultaneously, producing a single coherent response that considers both the image and the text query together.

The below example shows how ShopMax India processes a customer support query that includes a product damage photo and a text question using GPT-4o.

import base64
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_data = encode_image("customer_photo.jpg")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are a ShopMax India product support agent. Identify the product and diagnose the issue from the image and customer description."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": "data:image/jpeg;base64," + image_data}
                },
                {
                    "type": "text",
                    "text": "My Samsung TV shows a cracked screen after delivery. Order ORD-55321, Mumbai. Is this covered under warranty?"
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

It gives the following output,

The image shows clear physical damage - a cracked panel on the Samsung TV.
Order ORD-55321 assessment:

Damage type: Physical/impact damage
Warranty status: Physical damage is NOT covered under standard warranty.
Recommended action: This appears to be transit damage. Please escalate to
our Mumbai logistics team for a replacement claim under the delivery damage
policy within 48 hours of receipt.

Compress images before encoding to reduce token usage - GPT-4o resizes images internally but you are billed for input tokens. Use the detail: "low" option for simple categorisation tasks to reduce cost by up to 75%, and detail: "high" only when fine visual details matter. For ShopMax India, store multi-modal conversation history carefully - image tokens in history accumulate quickly and push context costs high for long conversations.

Send your comments, suggestions or queries regarding this site to [email protected].