Prompt Caching
Prompt caching lets API providers reuse the KV (key-value) cache for repeated prompt prefixes — slashing latency and cost for requests that share a large system prompt or context.
How It Works
Without caching, every request re-processes the entire prompt from scratch:
Request 1: [system: 5000 tokens][user: "Q1"] → process 5000+1 = 5001 tokens
Request 2: [system: 5000 tokens][user: "Q2"] → process 5000+1 = 5001 tokens ← waste!With caching, the repeated prefix is computed once and reused:
Request 1: [system: 5000 tokens][user: "Q1"] → compute 5001 tokens, cache 5000
Request 2: [CACHED: 5000 tokens][user: "Q2"] → compute 1 token + cache read ← 5000 tokens free1. Anthropic Prompt Caching
How to Use
Mark the parts of your prompt you want cached with "cache_control": {"type": "ephemeral"}.
pythonimport anthropic
client = anthropic.Anthropic()
# Large document you'll query many times
with open("large_codebase.txt") as f:
codebase = f.read() # ~50,000 tokens
def ask_about_code(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert code reviewer. Analyze the codebase thoroughly.",
},
{
"type": "text",
"text": codebase,
"cache_control": {"type": "ephemeral"}, # ← cache this block
},
],
messages=[{"role": "user", "content": question}],
)
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}") # ← free (90% off)
# Cost: cache_write=full price, cache_read=0.1× price
return response.content[0].text
# First call: writes cache (full price for 50K tokens)
answer1 = ask_about_code("What are the main modules in this codebase?")
# Second call: reads from cache (10% price for 50K tokens)
answer2 = ask_about_code("Are there any security vulnerabilities?")Cache Rules
- Minimum cacheable block: 1024 tokens (shorter blocks are ignored)
- Cache TTL: 5 minutes (ephemeral) — resets with each cache hit
- Cache write cost: 1.25× normal input price
- Cache read cost: 0.1× normal input price (90% discount)
- Max cache blocks: 4 per request (place
cache_controlon up to 4 blocks)
Multi-Turn with Cached System Prompt
pythonconversation_history = []
SYSTEM_WITH_CACHE = [
{"type": "text", "text": "You are a helpful coding assistant."},
{
"type": "text",
"text": large_documentation,
"cache_control": {"type": "ephemeral"}, # cached across turns
},
]
def chat(user_message: str) -> str:
conversation_history.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=SYSTEM_WITH_CACHE, # same cached system every turn
messages=conversation_history,
)
assistant_message = response.content[0].text
conversation_history.append({"role": "assistant", "content": assistant_message})
return assistant_messageCache Multi-Turn History
For long conversations, cache the history too:
pythondef chat_with_cached_history(user_message: str, history: list) -> str:
# Mark older history for caching, keep latest turns uncached
messages = []
for i, msg in enumerate(history):
if i < len(history) - 4: # cache all but last 4 turns
messages.append({
"role": msg["role"],
"content": [
{"type": "text", "text": msg["content"],
"cache_control": {"type": "ephemeral"}}
],
})
else:
messages.append(msg)
messages.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages,
)
return response.content[0].text2. OpenAI Prompt Caching
OpenAI caches automatically — no opt-in needed. Any prompt prefix ≥ 1024 tokens is eligible.
pythonfrom openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": large_system_prompt}, # ← auto-cached if ≥1024 tokens
{"role": "user", "content": "Question 1"},
],
)
# Check cache usage
usage = response.usage
print(f"Total input tokens: {usage.prompt_tokens}")
print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}") # free
# Cached tokens: 0 cost — full discount (unlike Anthropic's 90%)OpenAI cache rules:
- Automatic — no opt-in, no
cache_controlheaders needed - Minimum: 1024 tokens prefix
- Cache TTL: 5–10 minutes of inactivity
- Cache read cost: $0 — completely free
- Cache write cost: normal pricing (no surcharge unlike Anthropic)
- Granularity: 128-token increments
3. Cost Comparison
Scenario: 100 requests/hour, each with 10K token system prompt + 200 token user message
Without caching (gpt-4o @ $2.50/1M input):
100 × 10,200 tokens = 1,020,000 tokens/hr
Cost: $2.55/hr = $61.20/dayWith OpenAI caching (first request full price, rest free for cached prefix):
First request: 10,200 tokens = $0.026
Next 99 requests: 200 tokens each = $0.005 total
Total: ~$0.031/hr = $0.74/day ← 98% savingWith Anthropic caching (claude-sonnet-4-6 @ $3/1M input, $0.30/1M cache read):
Write (once per 5 min, 12×/hr): 12 × 10,000 × $3.75/1M = $0.45/hr
Read (remaining 88 requests): 88 × 10,000 × $0.30/1M = $0.26/hr
User tokens: 100 × 200 × $3/1M = $0.06/hr
Total: $0.77/hr vs $30.60/hr without cache = 97.5% saving4. Best Practices
Structure your prompt for maximum cache hits
python# ✅ GOOD: Static, large content first — small dynamic content last
messages = [
{"role": "system", "content": static_system_prompt}, # large, never changes
# Cached by both OpenAI and Anthropic
{"role": "user", "content": large_doc_for_analysis}, # large, changes per doc
# Can be cached with Anthropic cache_control
{"role": "user", "content": user_question}, # small, unique per request
]
# ❌ BAD: Dynamic content interspersed — breaks cache prefix
messages = [
{"role": "system", "content": f"Today is {datetime.now()}. " + static_prompt}, # changes every second!
{"role": "user", "content": user_question},
]Don't put timestamps or request IDs in cacheable blocks
python# ❌ Breaks caching — unique per request
system = f"Request ID: {uuid4()}. You are a helpful assistant..."
# ✅ Put dynamic info in the user message only
system = "You are a helpful assistant..." # static → cached
user = f"[Request: {request_id}] {user_question}" # dynamic → not cachedWarm the cache before traffic spike
pythonimport asyncio
async def warm_cache(questions: list[str]):
"""Send a dummy request to pre-populate the cache before load hits."""
await ask_about_code("warm up cache") # populates cache for large system prompt
print("Cache warmed — subsequent requests will use cached prefix")
# Call this at server startup
asyncio.run(warm_cache([]))5. When Caching Doesn't Help
- Short system prompts (< 1024 tokens) — below the minimum threshold
- Unique prompts per request — if every request has a different document, cache miss rate is 100%
- Low request rate — if you get < 1 request per 5 minutes, cache expires before it's useful
- Streaming first token — caching reduces total cost but doesn't improve TTFT (first token latency) unless the cache hit eliminates GPU processing time
Interview Q&A
Q: What's the difference between semantic caching and prompt caching?
- Prompt caching (Anthropic/OpenAI): Provider-side KV cache for repeated prompt prefixes. Works at the token level — the exact prefix bytes must match. Zero code required for OpenAI;
cache_controlheader for Anthropic. - Semantic caching (GPTCache, your own): Application-side cache that stores past (query → answer) pairs and returns cached answers for semantically similar new queries. Doesn't reduce per-token cost but eliminates the LLM call entirely for near-duplicate questions.
Use both: prompt caching reduces cost on every request; semantic caching eliminates the LLM call entirely for repeated questions.