In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Google Gemini API > Gemini API Prompt Caching Strategies

Gemini API Prompt Caching Strategies

Author: Venkata Sudhakar

When multiple API calls share the same large context - a 500-page product catalogue, a lengthy system prompt, or a legal document - you pay to process that context on every call. Gemini context caching lets you process it once and reuse the cached result for up to 24 hours. ShopMax India caches their 200-page product knowledge base and saves 60% on API costs for customer support queries.

You create a cache using genai.caching.CachedContent.create() with a TTL. All subsequent calls reference the cache ID instead of resending the content. The cache counts towards your storage quota but is much cheaper than repeatedly processing the same tokens.

The below example shows how ShopMax India caches a large product knowledge base and runs multiple queries against it efficiently.

Running multiple queries against the cached context,

It gives the following output,

Cache created: cachedContents/shopmax-kb-abc123
Cached tokens: 84,320
Expires: 2024-11-15 18:00:00 UTC

Q: What is the warranty period for Samsung televisions?
A: Samsung televisions at ShopMax India come with a 1-year manufacturer warranty...
   Tokens - prompt: 12, cached: 84320

Q: Does ShopMax India offer installation service for ACs?
A: Yes, ShopMax India offers free installation for air conditioners above Rs 30,000...
   Tokens - prompt: 14, cached: 84320

Cache TTL refreshed

The key metric is cached_content_token_count - those tokens are billed at the cache storage rate (much lower than inference rate). For ShopMax India, caching the 84,000-token product KB for 4 hours and running 500 queries against it saves approximately Rs 8,000 per day in API costs versus sending the full context each time.

Send your comments, suggestions or queries regarding this site to [email protected].