|
|
Gemini API Cost Optimisation
Author: Venkata Sudhakar
Gemini API costs scale with token usage - input tokens, output tokens, and context caching all affect your bill. Understanding and optimising token consumption is essential for running AI features profitably at scale. ShopMax India reduced its monthly Gemini spend by 40% after applying a set of targeted optimisation techniques across its agent and batch pipelines. The four most impactful optimisations are: choosing the right model tier (gemini-2.0-flash vs gemini-2.0-pro), capping output tokens, using context caching for repeated system prompts, and routing simple queries away from Gemini entirely. Each technique applies in different scenarios - the key is measuring before optimising. The below example shows how to measure token usage and estimate costs per request before optimising.
It gives the following output,
Input tokens: 18
Output tokens: 24
Est. cost: $0.000009 USD
Response: ShopMax India accepts returns within 10 days for unused electronics in original packaging.
The below example shows context caching - uploading a large system prompt once and reusing it across requests to avoid paying for repeated input tokens.
It gives the following output,
Cache created: cachedContents/shopmax_catalogue_cache_abc123
Cached tokens: 1024
Q: Which TVs are on sale?
A: ShopMax India currently has Samsung, LG, and Sony TVs on sale with discounts up to 20%.
Q: What laptops are available?
A: ShopMax India stocks laptops from Dell, HP, Lenovo, and Apple across all price ranges.
ShopMax India applied three optimisations: switched batch jobs from gemini-2.0-pro to gemini-2.0-flash (70% cost reduction), capped max_output_tokens to 200 for customer support responses (30% output reduction), and cached the 2,000-token system prompt shared across all support agents (saving Rs 8,000 per month at current traffic). Combined, these changes cut the monthly AI infrastructure bill from Rs 45,000 to Rs 27,000 without any reduction in response quality.
|
|