tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > OpenAI API > OpenAI Realtime API - Voice Conversations

OpenAI Realtime API - Voice Conversations

Author: Venkata Sudhakar

The OpenAI Realtime API enables low-latency, bidirectional voice conversations with GPT-4o over a WebSocket connection, supporting both audio and text modalities in the same session. ShopMax India uses the Realtime API to build a voice-based customer support agent that handles order status queries, return requests, and product questions over the phone without any human agent involvement.

The Realtime API connects via websocket to wss://api.openai.com/v1/realtime with a model parameter. Sessions are configured by sending a session.update event with instructions, voice (alloy, echo, shimmer), input_audio_format, and turn_detection settings. Audio is streamed as base64-encoded PCM16 chunks via input_audio_buffer.append events. The server emits response.audio.delta events with audio output chunks and response.done when the turn completes.

The below example shows a Python script that connects to the OpenAI Realtime API, configures a ShopMax India support agent, sends a text query, and receives a voice response.


It gives the following output,

Connected to OpenAI Realtime API
Agent: Thank you for contacting ShopMax India support! For order ORD-DEL-7721, I can see your laptop is currently in transit from our Delhi warehouse. Based on standard delivery timelines, it should arrive within 2 business days. You will receive an SMS with the tracking link shortly.
Response complete.
Audio saved: 47 chunks

The Realtime API charges for both input and output audio tokens, so implement turn_detection with server_vad mode to automatically detect speech boundaries and avoid processing silence. Set a system prompt that constrains the agent to only answer questions relevant to your business to prevent scope creep. Use session timeouts and reconnect logic as WebSocket connections can drop on mobile networks. For production telephony integration, convert the PCM16 output to the codec your telecom provider requires (typically G.711 mu-law). Log all conversation transcripts for quality monitoring and compliance auditing.


 
  


  
bl  br