|
|
OpenAI Audio API - Text-to-Speech and Whisper Transcription
Author: Venkata Sudhakar
The OpenAI Audio API offers two core capabilities - Text-to-Speech (TTS) for converting text into natural-sounding audio, and Whisper for transcribing speech into text. ShopMax India uses TTS to send automated voice notifications to customers for order confirmations and delivery updates, and Whisper to convert inbound voice complaints from the support hotline into text for automated ticket creation.
TTS is accessed via client.audio.speech.create(), accepting model (tts-1 for low latency, tts-1-hd for high quality), voice (alloy, echo, fable, onyx, nova, shimmer), and input text. The response streams binary audio saved as an MP3 file. Whisper is accessed via client.audio.transcriptions.create(), accepting a WAV or MP3 file and model=whisper-1. Setting language to the ISO code improves accuracy for regional accents.
The below example shows ShopMax India generating a TTS order confirmation message for a Mumbai customer and transcribing an inbound voice complaint into a support ticket.
It gives the following output,
Audio saved: order_confirmation.mp3
Transcript: I received a damaged TV remote with my order. The buttons are not working and the packaging was torn. Please send a replacement.
Support ticket created: TKT-00421
Use tts-1 for real-time notifications (lower latency) and tts-1-hd for pre-recorded content like IVR menus. Cache common TTS phrases such as shipping updates to avoid repeated API calls. For Whisper, pass language=hi for Hindi audio or language=ta for Tamil to improve accuracy on Indian accents. Whisper accepts files up to 25 MB - split longer recordings before sending. Store transcripts alongside ticket IDs in the database for complete audit trails.
|
|