In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > OpenAI API > OpenAI Prompt Caching - Reducing Costs on Repeated Context

OpenAI Prompt Caching - Reducing Costs on Repeated Context

Author: Venkata Sudhakar

OpenAI Prompt Caching automatically reduces the cost of API calls when the same large context prefix is sent repeatedly. When a prompt begins with a cached prefix, OpenAI charges a discounted rate (50 percent off for gpt-4o) for the cached tokens instead of full price. ShopMax India uses prompt caching to serve hundreds of daily customer queries against a large product catalogue system prompt, reducing monthly API costs significantly without any code changes.

Prompt caching is automatic on gpt-4o, gpt-4o-mini, and o-series models - no explicit configuration is needed. The cache is keyed on the exact text of the prompt prefix (minimum 1024 tokens). To benefit from caching, always place stable, reusable content at the start of the prompt (system instructions, product catalogue, policy documents) and append the variable user query at the end. The usage object in the response includes a prompt_tokens_details field with cached_tokens showing how many tokens were served from cache.

The below example shows ShopMax India sending multiple customer queries against a large product catalogue prompt and observing cache hits in the usage statistics.

import openai

client = openai.OpenAI(api_key="your-openai-api-key")

# Large stable system prompt - placed first so it gets cached
PRODUCT_CATALOGUE = """
You are a product advisor for ShopMax India, an electronics retailer.

PRODUCT CATALOGUE:
1. Samsung 65-inch LED TV UA65 - Rs 89,999 - 4K, HDR, Smart TV, 3 HDMI ports
2. OnePlus 13 - Rs 69,999 - 50MP camera, 5400mAh battery, 100W charging
3. Dell Inspiron 15 Laptop - Rs 58,499 - Intel i5, 16GB RAM, 512GB SSD
4. Sony WH-1000XM5 Headphones - Rs 29,990 - ANC, 30hr battery, multipoint
5. Apple iPad Air M2 - Rs 74,900 - 11-inch, 256GB, Wi-Fi + Cellular
6. LG 1.5-Ton AC - Rs 42,500 - 5-star, inverter, Wi-Fi control
7. Canon EOS R50 Camera - Rs 68,995 - 24MP, 4K video, kit lens included
8. Bosch 8kg Washing Machine - Rs 38,999 - front load, auto clean

POLICIES:
- Free delivery on orders above Rs 999 to Mumbai, Bangalore, Delhi, Hyderabad, Chennai
- 10-day return policy on all electronics with original packaging
- EMI available: 3/6/12 months via all major credit cards at 0 percent interest
- Warranty: 1 year manufacturer warranty on all products
"""

queries = [
    "What is the price of the Sony headphones and do they support noise cancellation?",
    "Is EMI available on the Samsung TV? What are the options?",
    "Which laptop do you have under Rs 60000 and what are the specs?"
]

for i, query in enumerate(queries, 1):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": PRODUCT_CATALOGUE},
            {"role": "user", "content": query}
        ]
    )
    usage = response.usage
    cached = usage.prompt_tokens_details.cached_tokens if usage.prompt_tokens_details else 0
    print(f"Query {i}: {query[:50]}...")
    print(f"  Answer: {response.choices[0].message.content[:100]}...")
    print(f"  Tokens: {usage.prompt_tokens} prompt ({cached} cached), {usage.completion_tokens} completion\n")

It gives the following output,

Query 1: What is the price of the Sony headphones and do they suppor...
  Answer: The Sony WH-1000XM5 Headphones are priced at Rs 29,990. Yes, they feature industry-leading Active Noise Cancellation...
  Tokens: 312 prompt (0 cached), 58 completion

Query 2: Is EMI available on the Samsung TV? What are the options?...
  Answer: Yes, EMI is available on the Samsung 65-inch LED TV (Rs 89,999). You can choose from 3, 6, or 12-month plans at 0 percent interest...
  Tokens: 312 prompt (256 cached), 62 completion

Query 3: Which laptop do you have under Rs 60000 and what are the sp...
  Answer: ShopMax India offers the Dell Inspiron 15 Laptop at Rs 58,499 - Intel i5 processor, 16GB RAM, and 512GB SSD...
  Tokens: 312 prompt (256 cached), 55 completion

To maximise cache hits, always structure prompts with the static prefix first and the dynamic user content last - even a single character change in the cached prefix invalidates it. Cache entries persist for 5-10 minutes of inactivity, so high-traffic applications will see near-100 percent cache hit rates. For very large static contexts (full product catalogues, legal documents), consider using a single shared system prompt across all users rather than personalising it, to maximise cache reuse. Monitor cached_tokens in production to measure actual savings - a 300-token system prompt with 1000 daily queries saves roughly 270,000 tokens at the 50 percent discount rate. Prompt caching also applies to multi-turn conversations where previous turns are re-sent as history.

Send your comments, suggestions or queries regarding this site to [email protected].