In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini Token Counting and Cost Optimisation

Gemini Token Counting and Cost Optimisation

Author: Venkata Sudhakar

Every Gemini API call is billed by token ï¿½ input tokens for what you send and output tokens for what the model generates. Understanding token counts before they become invoices is essential for production cost management. Gemini provides a count_tokens API that returns the exact token count for any content before you send it, letting you gate expensive calls, trim prompts that exceed limits, and forecast monthly costs based on observed usage patterns. A 10,000-token prompt costs dramatically more than a 500-token prompt ï¿½ optimising prompt length is the single highest-leverage cost reduction available.

The key optimisation techniques are: count tokens before sending to gate calls above a cost threshold; trim context to include only the most recent N turns rather than full history; use cheaper models (gemini-2.0-flash instead of gemini-1.5-pro) for simple tasks; cache repeated large contexts with context caching (Tutorial 312); use the Batch API (Tutorial 300) for non-real-time work at 50 percent discount; and monitor per-session token usage via usage_metadata in every response to catch runaway conversations before they blow the budget.

The below example builds a cost-aware wrapper that counts tokens before every API call, trims conversation history when context grows too long, and tracks cumulative spend per session with a budget alert threshold.

from google import genai
from google.genai import types

client = genai.Client(api_key="your-gemini-api-key")

# Pricing per million tokens (gemini-2.0-flash as of 2025)
PRICE = {"input": 0.075, "output": 0.30}  # USD per million tokens
MAX_INPUT_TOKENS = 4000   # trim history if context exceeds this
BUDGET_ALERT_USD = 0.05   # warn when session spend exceeds this

def count_tokens(messages: list, system: str = "") -> int:
    # Count tokens for a given message list before sending
    contents = []
    if system:
        contents.append(types.Content(
            role="user", parts=[types.Part(text="[SYSTEM] " + system)]
        ))
    for m in messages:
        contents.append(types.Content(
            role=m["role"],
            parts=[types.Part(text=m["content"])]
        ))
    resp = client.models.count_tokens(
        model="gemini-2.0-flash",
        contents=contents
    )
    return resp.total_tokens

def trim_history(messages: list, system: str, max_tokens: int) -> list:
    # Remove oldest turns until context fits within token budget
    while len(messages) > 2:
        token_count = count_tokens(messages, system)
        if token_count <= max_tokens:
            break
        messages = messages[2:]  # remove oldest user+assistant pair
    return messages

def calculate_cost(input_tokens: int, output_tokens: int) -> float:
    return ((input_tokens  * PRICE["input"] +
             output_tokens * PRICE["output"]) / 1_000_000)

Cost-aware chat function with token counting and budget tracking,

SYSTEM = "You are a ShopMax India customer service agent. Answer concisely."
session_history = []
session_cost_usd = 0.0

def cost_aware_chat(user_message: str) -> dict:
    global session_history, session_cost_usd

# Add user message
    session_history.append({"role": "user", "content": user_message})

# Count tokens BEFORE sending
    raw_count = count_tokens(session_history, SYSTEM)
    print("Token count before trim:", raw_count)

# Trim if too long
    if raw_count > MAX_INPUT_TOKENS:
        session_history = trim_history(session_history, SYSTEM, MAX_INPUT_TOKENS)
        trimmed_count = count_tokens(session_history, SYSTEM)
        print("Trimmed to:", trimmed_count, "tokens")

# Send to Gemini
    contents = [types.Content(role=m["role"],
                              parts=[types.Part(text=m["content"])])
                for m in session_history]
    resp = client.models.generate_content(
        model="gemini-2.0-flash",
        config=types.GenerateContentConfig(
            system_instruction=SYSTEM, max_output_tokens=200
        ),
        contents=contents
    )
    reply = resp.text
    session_history.append({"role": "model", "content": reply})

# Track cost from usage metadata
    u = resp.usage_metadata
    call_cost = calculate_cost(u.prompt_token_count or 0,
                               u.candidates_token_count or 0)
    session_cost_usd += call_cost

if session_cost_usd > BUDGET_ALERT_USD:
        print("BUDGET ALERT: session spend $" + str(round(session_cost_usd, 4)))

return {
        "reply":        reply,
        "input_tokens": u.prompt_token_count,
        "output_tokens":u.candidates_token_count,
        "call_cost_usd":round(call_cost, 5),
        "session_total": round(session_cost_usd, 4)
    }

for msg in ["Hi, where is order ORD-88421?",
            "What is your return policy for electronics?",
            "Can I exchange a TV I bought last week?"]:
    result = cost_aware_chat(msg)
    print("Reply:", result["reply"][:100])
    print("Tokens in/out:", result["input_tokens"], "/", result["output_tokens"],
          "| Call: $" + str(result["call_cost_usd"]),
          "| Session: $" + str(result["session_total"]))
    print()

It gives the following output with full token and cost visibility per call,

Token count before trim: 42
Reply: I checked order ORD-88421 and it is out for delivery, expected today.
Tokens in/out: 42 / 38 | Call: $0.00001 | Session: $0.0000

Token count before trim: 112
Reply: Electronics can be returned within 7 days if unused and in original packaging.
Tokens in/out: 112 / 32 | Call: $0.00002 | Session: $0.0000

Token count before trim: 198
Reply: Yes, you can exchange a TV purchased within the last 7 days at any ShopMax store.
Tokens in/out: 198 / 28 | Call: $0.00002 | Session: $0.0001

# Token count grows with history - trim kicks in when it exceeds MAX_INPUT_TOKENS
# At 1,000 conversations/day this visibility prevents surprise monthly bills
# Budget alert fires immediately when session spend crosses threshold

Cost optimisation priority order: first switch from gemini-1.5-pro to gemini-2.0-flash for tasks that do not need maximum reasoning ï¿½ this alone cuts cost by 10x. Second, trim conversation history to keep context under 2000 tokens for most customer service interactions. Third, enable context caching for any system prompt or knowledge base content sent with every request. Fourth, use the Batch API for any non-real-time processing like nightly report generation. Fifth, monitor per-session token counts in Cloud Logging and set a billing alert in GCP at 80 percent of your monthly budget. These five steps together typically reduce Gemini API costs by 60 to 80 percent compared to a naive implementation.

Send your comments, suggestions or queries regarding this site to [email protected].