What is Prompt Caching?
Prompt caching lets you cache frequently reused content (system prompts, tool definitions, large documents) across API requests. Instead of re-processing the same tokens every time, Claude reads them from cache at 1/10th the cost. For applications with long system prompts or RAG context, this can reduce costs by 90%.
Answer-Ready: Anthropic's prompt caching reduces Claude API costs up to 90%. Cache system prompts, tool definitions, and documents across requests. Cached tokens cost 1/10th of input tokens. 5-minute TTL, auto-extended on hit. Essential for production AI applications with recurring context.
Best for: Teams running Claude API at scale with recurring system prompts. Works with: Claude Sonnet, Opus, Haiku via Anthropic API. Setup time: Add one field to existing code.
How It Works
Pricing Impact
| Token Type | Cost (Sonnet) | Savings |
|---|---|---|
| Input (no cache) | $3/M tokens | Baseline |
| Cache write | $3.75/M tokens | -25% first time |
| Cache read | $0.30/M tokens | 90% savings |
| Output | $15/M tokens | No change |
Cache TTL
- Default: 5 minutes
- Auto-extended: Each cache hit resets the 5-minute timer
- Minimum cacheable: 1,024 tokens (Sonnet), 2,048 tokens (Opus)
Cacheable Content
1. System Prompts
system=[{
"type": "text",
"text": "Your 2000-token system prompt here...",
"cache_control": {"type": "ephemeral"},
}]2. Tool Definitions
tools = [
{"name": "search", "description": "...", "input_schema": {...},
"cache_control": {"type": "ephemeral"}},
]3. Long Documents (RAG Context)
messages = [
{"role": "user", "content": [
{"type": "text", "text": "Here is the full codebase:\n" + large_document,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "Now answer: what does the auth module do?"},
]},
]4. Multi-Turn with Cached Prefix
# Turn 1: Cache the document
msg1 = {"role": "user", "content": [
{"type": "text", "text": document, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "Summarize this document"},
]}
# Turn 2: Document is already cached, only new question costs full price
msg2 = {"role": "user", "content": "Now list the key findings"}Cost Calculator
| Scenario | Without Cache | With Cache | Savings |
|---|---|---|---|
| 4K system prompt, 100 requests | $1.20 | $0.24 | 80% |
| 10K RAG context, 50 requests | $1.50 | $0.19 | 87% |
| 20K tools, 200 requests | $12.00 | $1.35 | 89% |
Best Practices
- Cache the longest, most stable content first — System prompts and tool definitions change rarely
- Order matters — Cached content must be a prefix. Put cached content before dynamic content
- Monitor cache hits — Check
usage.cache_creation_input_tokensandusage.cache_read_input_tokensin response - Minimum size — Content must be >= 1,024 tokens to be cacheable
- Keep cache warm — If requests are >5 minutes apart, cache expires
FAQ
Q: Does caching affect response quality? A: No. Caching is purely an optimization — the model sees identical input.
Q: Can I cache across different conversations? A: Yes, if the cached prefix is identical (same system prompt + tools), it hits the cache regardless of the user message.
Q: Does Claude Code use prompt caching? A: Yes, Claude Code automatically uses prompt caching for CLAUDE.md content, tool definitions, and conversation history.