In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini Audio Understanding

Gemini Audio Understanding

Author: Venkata Sudhakar

The Gemini API supports audio input natively. You can upload audio files and ask Gemini to transcribe speech, identify speakers, summarise conversations, and extract structured data from recordings. ShopMax India uses this to process customer service call recordings automatically.

Supported audio formats include MP3, WAV, FLAC, AAC, OGG, and WEBM. Files up to 9.5 hours long and 1 GB in size are supported via the Files API. Audio under 20 MB can also be inlined as base64 directly in the request.

The below example shows how to upload a WAV file and transcribe speech using Gemini.

It gives the following output,

Agent: Thank you for calling ShopMax India. How can I help you today?
Customer: Hi, I ordered a Samsung TV last week but it has not arrived yet.
Agent: I am sorry to hear that. Can I get your order number please?
Customer: Yes, it is SM-2024-98765.
Agent: Thank you. Let me check the delivery status for you right away.

The below example shows how to extract structured information from the same recording using Gemini structured output.

It gives the following output,

{
  "customer_issue": "TV order not delivered after one week",
  "order_number": "SM-2024-98765",
  "sentiment": "neutral",
  "resolution": "Agent checking delivery status",
  "follow_up_needed": true
}

ShopMax India processes hundreds of customer calls daily across Mumbai, Bangalore, and Hyderabad. Using Gemini Audio, the team automatically tags unresolved issues, tracks order complaints by region, and routes priority cases to senior agents - saving hours of manual review per shift.

Send your comments, suggestions or queries regarding this site to [email protected].