|
|
Gemini Token Counting and Cost Optimisation
Author: Venkata Sudhakar
Every Gemini API call is billed by token � input tokens for what you send and output tokens for what the model generates. Understanding token counts before they become invoices is essential for production cost management. Gemini provides a count_tokens API that returns the exact token count for any content before you send it, letting you gate expensive calls, trim prompts that exceed limits, and forecast monthly costs based on observed usage patterns. A 10,000-token prompt costs dramatically more than a 500-token prompt � optimising prompt length is the single highest-leverage cost reduction available. The key optimisation techniques are: count tokens before sending to gate calls above a cost threshold; trim context to include only the most recent N turns rather than full history; use cheaper models (gemini-2.0-flash instead of gemini-1.5-pro) for simple tasks; cache repeated large contexts with context caching (Tutorial 312); use the Batch API (Tutorial 300) for non-real-time work at 50 percent discount; and monitor per-session token usage via usage_metadata in every response to catch runaway conversations before they blow the budget. The below example builds a cost-aware wrapper that counts tokens before every API call, trims conversation history when context grows too long, and tracks cumulative spend per session with a budget alert threshold.
Cost-aware chat function with token counting and budget tracking,
It gives the following output with full token and cost visibility per call,
Token count before trim: 42
Reply: I checked order ORD-88421 and it is out for delivery, expected today.
Tokens in/out: 42 / 38 | Call: $0.00001 | Session: $0.0000
Token count before trim: 112
Reply: Electronics can be returned within 7 days if unused and in original packaging.
Tokens in/out: 112 / 32 | Call: $0.00002 | Session: $0.0000
Token count before trim: 198
Reply: Yes, you can exchange a TV purchased within the last 7 days at any ShopMax store.
Tokens in/out: 198 / 28 | Call: $0.00002 | Session: $0.0001
# Token count grows with history - trim kicks in when it exceeds MAX_INPUT_TOKENS
# At 1,000 conversations/day this visibility prevents surprise monthly bills
# Budget alert fires immediately when session spend crosses threshold
Cost optimisation priority order: first switch from gemini-1.5-pro to gemini-2.0-flash for tasks that do not need maximum reasoning � this alone cuts cost by 10x. Second, trim conversation history to keep context under 2000 tokens for most customer service interactions. Third, enable context caching for any system prompt or knowledge base content sent with every request. Fourth, use the Batch API for any non-real-time processing like nightly report generation. Fifth, monitor per-session token counts in Cloud Logging and set a billing alert in GCP at 80 percent of your monthly budget. These five steps together typically reduce Gemini API costs by 60 to 80 percent compared to a naive implementation.
|
|