In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Prompt Engineering > Claude Prompt Caching

Claude Prompt Caching

Author: Venkata Sudhakar

Every Claude API call charges for every input token - your system prompt, background documents, conversation history. If a product FAQ chatbot sends a 5,000-token product manual with every customer question, you pay for those 5,000 tokens ten thousand times a day. Prompt caching solves this: mark the large static portion of your prompt with a cache_control header and Claude stores those tokens server-side for up to 5 minutes. Subsequent requests that include the same cached prefix cost only 10% of normal input pricing for that portion. For any application that sends the same large context repeatedly, this single change cuts input token costs by 80-90%.

You enable caching by adding a cache_control block with type "ephemeral" to any content block you want cached. Put static content first (system instructions, reference documents) and dynamic content last (user question). The cache lasts 5 minutes from the last use - each hit resets the timer, so a busy chatbot has effectively permanent caching during peak hours. The API response shows cache_read_input_tokens (tokens served from cache at 10% cost) and cache_creation_input_tokens (tokens written to cache at 125% cost, charged once).

The below example shows a home appliance support chatbot that caches its product manual. The same manual is referenced by thousands of customers daily - caching it cuts the input cost from full price to 10% for every request after the first.

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

# Large product manual - in production this could be 5,000-20,000 tokens
PRODUCT_MANUAL = """
WashPro 8000 - Complete Support Manual

ERROR CODES:
E1 - Water inlet blocked. Check tap is open and inlet filter is clean.
E2 - Drainage blocked. Check hose is not kinked. Clean pump filter.
E3 - Motor overheat. Unplug 30 min, reduce load, restart.
E4 - Door lock fault. Ensure door clicks fully shut.
E5 - Load imbalance. Redistribute laundry evenly and restart.

PROGRAMMES:
Cotton 90: Heavy soiling, towels, bedding. 2hr 10min.
Cotton 60: Normal cottons. 1hr 45min.
Synthetics 40: Shirts, blouses, mixed fabrics. 1hr 5min.
Quick 15: Lightly soiled, 2kg max. 15min.
Wool: Cold water, gentle drum. 45min.
Sports: Technical fabrics, activewear. 1hr 20min.

MAINTENANCE:
Monthly: Clean detergent drawer and door seal. Run drum clean cycle.
Quarterly: Clean pump filter behind bottom-front panel.
Annually: Descale with citric acid cycle in hard water areas.

SPECS:
Capacity 8kg. Spin 400-1400 RPM. Energy A+++.
Water: 49L per Cotton 60 cycle. Noise: 52dB wash, 72dB spin.
Warranty: 2 years parts and labour.
"""

def ask_support(question: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=300,
        system=[
            {
                "type": "text",
                "text": "You are a WashPro customer support specialist. "
                        "Answer questions using the product manual. Be concise and practical."
            },
            {
                "type": "text",
                "text": PRODUCT_MANUAL,
                "cache_control": {"type": "ephemeral"}  # Cache the manual!
            }
        ],
        messages=[{"role": "user", "content": question}]
    )
    u = response.usage
    return {
        "answer":          response.content[0].text,
        "cached_tokens":   getattr(u, "cache_read_input_tokens",    0),
        "written_tokens":  getattr(u, "cache_creation_input_tokens", 0),
        "regular_tokens":  u.input_tokens
    }

Running three customer questions to observe cache behaviour,

questions = [
    "My machine is showing E2. What should I do?",
    "What programme should I use for my sports kit?",
    "How often should I clean the pump filter?"
]

for i, q in enumerate(questions, 1):
    result = ask_support(q)
    status = "CACHE HIT" if result["cached_tokens"] > 0 else "CACHE MISS (first call)"
    print(f"Q{i}: {q}")
    print(f"     [{status}] cached={result['cached_tokens']} written={result['written_tokens']} regular={result['regular_tokens']}")
    print(f"     Answer: {result['answer'][:100]}...")
    print()

# Cost comparison for 10,000 daily questions
TOKEN_PRICE    = 0.80   # claude-haiku-4-5 per million input tokens
CACHE_PRICE    = 0.08   # cached tokens cost 10% of normal
MANUAL_TOKENS  = 400    # approx tokens in manual above
DAILY_CALLS    = 10_000

cost_without = (MANUAL_TOKENS * DAILY_CALLS / 1_000_000) * TOKEN_PRICE
cost_with    = (MANUAL_TOKENS * 1.25 / 1_000_000) * TOKEN_PRICE + \
               (MANUAL_TOKENS * (DAILY_CALLS - 1) / 1_000_000) * CACHE_PRICE

print("=== Daily Cost for 10,000 Customer Questions ===")
print("Without caching: $" + str(round(cost_without, 2)))
print("With caching:    $" + str(round(cost_with, 3)))
print("Saving:          " + str(round((1 - cost_with/cost_without) * 100)) + "%")

It gives the following output,

Q1: My machine is showing E2. What should I do?
     [CACHE MISS (first call)] cached=0 written=412 regular=38
     Answer: E2 is a drainage error. Check your drain hose is not kinked or blocked,
     then clean the pump filter behind the bottom-front panel...

Q2: What programme should I use for my sports kit?
     [CACHE HIT] cached=412 written=0 regular=21
     Answer: Use the Sports programme - designed for technical fabrics and activewear,
     it runs for 1 hour 20 minutes...

Q3: How often should I clean the pump filter?
     [CACHE HIT] cached=412 written=0 regular=22
     Answer: Clean the pump filter quarterly. It is located behind the bottom-front
     panel of the machine...

=== Daily Cost for 10,000 Customer Questions ===
Without caching: $3.20
With caching:    $0.32
Saving:          90%

# Q1: Cache MISS - manual written to cache (charged once at 125%)
# Q2 and Q3: Cache HIT - manual served at 10% cost
# 90% cost reduction at 10,000 daily calls = saves $2.88/day = $86/month

Three rules for maximum cache efficiency: always put static content before dynamic content so the cached prefix is identical across requests; cache at the right granularity - cache the product manual, not the whole conversation history which changes every turn; and keep your system prompt consistent across all calls because even one character difference breaks the cache. Prompt caching works best for: large knowledge bases queried by many customers, long system prompts with detailed instructions, and the growing conversation history in multi-turn agents where earlier turns are stable.

Send your comments, suggestions or queries regarding this site to [email protected].