In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > OpenAI API > OpenAI Realtime API - Voice Conversations

OpenAI Realtime API - Voice Conversations

Author: Venkata Sudhakar

The OpenAI Realtime API enables low-latency, bidirectional voice conversations with GPT-4o over a WebSocket connection, supporting both audio and text modalities in the same session. ShopMax India uses the Realtime API to build a voice-based customer support agent that handles order status queries, return requests, and product questions over the phone without any human agent involvement.

The Realtime API connects via websocket to wss://api.openai.com/v1/realtime with a model parameter. Sessions are configured by sending a session.update event with instructions, voice (alloy, echo, shimmer), input_audio_format, and turn_detection settings. Audio is streamed as base64-encoded PCM16 chunks via input_audio_buffer.append events. The server emits response.audio.delta events with audio output chunks and response.done when the turn completes.

The below example shows a Python script that connects to the OpenAI Realtime API, configures a ShopMax India support agent, sends a text query, and receives a voice response.

import asyncio
import json
import base64
import websockets

API_KEY = "your-openai-api-key"
URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"

async def run_support_agent():
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

async with websockets.connect(URL, additional_headers=headers) as ws:
        print("Connected to OpenAI Realtime API")

# Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a helpful customer support agent for ShopMax India, an electronics retailer. Help customers with order status, returns, and product queries. Be concise and polite.",
                "voice": "alloy",
                "output_audio_format": "pcm16"
            }
        }))

# Send a text message as if typed by the customer
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{"type": "input_text", "text": "Hi, I ordered a laptop from ShopMax India. Order ID ORD-DEL-7721. When will it arrive?"}]
            }
        }))
        await ws.send(json.dumps({"type": "response.create"}))

audio_chunks = []
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.audio_transcript.delta":
                print("Agent:", event.get("delta", ""), end="", flush=True)
            elif event["type"] == "response.audio.delta":
                audio_chunks.append(base64.b64decode(event["delta"]))
            elif event["type"] == "response.done":
                print("\nResponse complete.")
                break

# Save audio response
        with open("support_response.pcm", "wb") as f:
            for chunk in audio_chunks:
                f.write(chunk)
        print(f"Audio saved: {len(audio_chunks)} chunks")

asyncio.run(run_support_agent())

It gives the following output,

Connected to OpenAI Realtime API
Agent: Thank you for contacting ShopMax India support! For order ORD-DEL-7721, I can see your laptop is currently in transit from our Delhi warehouse. Based on standard delivery timelines, it should arrive within 2 business days. You will receive an SMS with the tracking link shortly.
Response complete.
Audio saved: 47 chunks

The Realtime API charges for both input and output audio tokens, so implement turn_detection with server_vad mode to automatically detect speech boundaries and avoid processing silence. Set a system prompt that constrains the agent to only answer questions relevant to your business to prevent scope creep. Use session timeouts and reconnect logic as WebSocket connections can drop on mobile networks. For production telephony integration, convert the PCM16 output to the codec your telecom provider requires (typically G.711 mu-law). Log all conversation transcripts for quality monitoring and compliance auditing.

Send your comments, suggestions or queries regarding this site to [email protected].