# Anthropic Prompt Caching — Cut AI API Costs 90% > Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings. ## Install Paste the prompt below into your AI tool: ## Quick Use ```python import anthropic client = anthropic.Anthropic() response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": "You are an expert code reviewer... (long system prompt)", "cache_control": {"type": "ephemeral"}, # Cache this! }], messages=[{"role": "user", "content": "Review this code..."}], ) # First request: full price # Subsequent requests: 90% cheaper for cached portion ``` ## What is Prompt Caching? Prompt caching lets you cache frequently reused content (system prompts, tool definitions, large documents) across API requests. Instead of re-processing the same tokens every time, Claude reads them from cache at 1/10th the cost. For applications with long system prompts or RAG context, this can reduce costs by 90%. **Answer-Ready**: Anthropic's prompt caching reduces Claude API costs up to 90%. Cache system prompts, tool definitions, and documents across requests. Cached tokens cost 1/10th of input tokens. 5-minute TTL, auto-extended on hit. Essential for production AI applications with recurring context. **Best for**: Teams running Claude API at scale with recurring system prompts. **Works with**: Claude Sonnet, Opus, Haiku via Anthropic API. **Setup time**: Add one field to existing code. ## How It Works ### Pricing Impact | Token Type | Cost (Sonnet) | Savings | |-----------|---------------|---------| | Input (no cache) | $3/M tokens | Baseline | | Cache write | $3.75/M tokens | -25% first time | | Cache read | $0.30/M tokens | **90% savings** | | Output | $15/M tokens | No change | ### Cache TTL - Default: 5 minutes - Auto-extended: Each cache hit resets the 5-minute timer - Minimum cacheable: 1,024 tokens (Sonnet), 2,048 tokens (Opus) ## Cacheable Content ### 1. System Prompts ```python system=[{ "type": "text", "text": "Your 2000-token system prompt here...", "cache_control": {"type": "ephemeral"}, }] ``` ### 2. Tool Definitions ```python tools = [ {"name": "search", "description": "...", "input_schema": {...}, "cache_control": {"type": "ephemeral"}}, ] ``` ### 3. Long Documents (RAG Context) ```python messages = [ {"role": "user", "content": [ {"type": "text", "text": "Here is the full codebase:\ " + large_document, "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": "Now answer: what does the auth module do?"}, ]}, ] ``` ### 4. Multi-Turn with Cached Prefix ```python # Turn 1: Cache the document msg1 = {"role": "user", "content": [ {"type": "text", "text": document, "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": "Summarize this document"}, ]} # Turn 2: Document is already cached, only new question costs full price msg2 = {"role": "user", "content": "Now list the key findings"} ``` ## Cost Calculator | Scenario | Without Cache | With Cache | Savings | |----------|--------------|------------|---------| | 4K system prompt, 100 requests | $1.20 | $0.24 | 80% | | 10K RAG context, 50 requests | $1.50 | $0.19 | 87% | | 20K tools, 200 requests | $12.00 | $1.35 | 89% | ## Best Practices 1. **Cache the longest, most stable content first** — System prompts and tool definitions change rarely 2. **Order matters** — Cached content must be a prefix. Put cached content before dynamic content 3. **Monitor cache hits** — Check `usage.cache_creation_input_tokens` and `usage.cache_read_input_tokens` in response 4. **Minimum size** — Content must be >= 1,024 tokens to be cacheable 5. **Keep cache warm** — If requests are >5 minutes apart, cache expires ## FAQ **Q: Does caching affect response quality?** A: No. Caching is purely an optimization — the model sees identical input. **Q: Can I cache across different conversations?** A: Yes, if the cached prefix is identical (same system prompt + tools), it hits the cache regardless of the user message. **Q: Does Claude Code use prompt caching?** A: Yes, Claude Code automatically uses prompt caching for CLAUDE.md content, tool definitions, and conversation history. ## Source & Thanks > - [Anthropic Prompt Caching Docs](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) > - [Pricing page](https://anthropic.com/pricing) ## Quick Use Add a `cache_control` field to the system prompt — subsequent requests save 90% on cost. ## What is Prompt Caching? Anthropic's prompt caching caches repeated content (system prompts, tool definitions, long documents) for reuse across requests; cache reads cost only 1/10. **TL;DR**: Anthropic prompt caching. Cache system prompts / tools / documents. Read cost is 1/10. 5-minute TTL auto-renews. Must-use for production Claude apps. Up to 90% savings. ## Cacheable Content 1. System prompts — most common case 2. Tool definitions — big wins with many tools 3. RAG documents — multi-turn Q&A over the same doc 4. Multi-turn conversation prefix — cache early context ## Best Practices - Cache the longest, most stable content first - Cached content must be a prefix - Monitor cache_read_input_tokens to confirm hits - Minimum 1024 tokens ## FAQ **Q: Does it affect quality?** A: No — the model sees identical input either way. **Q: Does Claude Code use it?** A: Yes — CLAUDE.md and tool definitions are cached automatically. ## Source & Thanks > [Anthropic Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) --- Source: https://tokrepo.com/en/workflows/anthropic-prompt-caching-cut-ai-api-costs-90-ed25d3cb Author: Anthropic