# Anthropic Prompt Caching — Cut AI API Costs 90% > Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings. ## Install Paste the prompt below into your AI tool: ## Quick Use ```python import anthropic client = anthropic.Anthropic() response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": "You are an expert code reviewer... (long system prompt)", "cache_control": {"type": "ephemeral"}, # Cache this! }], messages=[{"role": "user", "content": "Review this code..."}], ) # First request: full price # Subsequent requests: 90% cheaper for cached portion ``` ## What is Prompt Caching? Prompt caching lets you cache frequently reused content (system prompts, tool definitions, large documents) across API requests. Instead of re-processing the same tokens every time, Claude reads them from cache at 1/10th the cost. For applications with long system prompts or RAG context, this can reduce costs by 90%. **Answer-Ready**: Anthropic's prompt caching reduces Claude API costs up to 90%. Cache system prompts, tool definitions, and documents across requests. Cached tokens cost 1/10th of input tokens. 5-minute TTL, auto-extended on hit. Essential for production AI applications with recurring context. **Best for**: Teams running Claude API at scale with recurring system prompts. **Works with**: Claude Sonnet, Opus, Haiku via Anthropic API. **Setup time**: Add one field to existing code. ## How It Works ### Pricing Impact | Token Type | Cost (Sonnet) | Savings | |-----------|---------------|---------| | Input (no cache) | $3/M tokens | Baseline | | Cache write | $3.75/M tokens | -25% first time | | Cache read | $0.30/M tokens | **90% savings** | | Output | $15/M tokens | No change | ### Cache TTL - Default: 5 minutes - Auto-extended: Each cache hit resets the 5-minute timer - Minimum cacheable: 1,024 tokens (Sonnet), 2,048 tokens (Opus) ## Cacheable Content ### 1. System Prompts ```python system=[{ "type": "text", "text": "Your 2000-token system prompt here...", "cache_control": {"type": "ephemeral"}, }] ``` ### 2. Tool Definitions ```python tools = [ {"name": "search", "description": "...", "input_schema": {...}, "cache_control": {"type": "ephemeral"}}, ] ``` ### 3. Long Documents (RAG Context) ```python messages = [ {"role": "user", "content": [ {"type": "text", "text": "Here is the full codebase:\n" + large_document, "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": "Now answer: what does the auth module do?"}, ]}, ] ``` ### 4. Multi-Turn with Cached Prefix ```python # Turn 1: Cache the document msg1 = {"role": "user", "content": [ {"type": "text", "text": document, "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": "Summarize this document"}, ]} # Turn 2: Document is already cached, only new question costs full price msg2 = {"role": "user", "content": "Now list the key findings"} ``` ## Cost Calculator | Scenario | Without Cache | With Cache | Savings | |----------|--------------|------------|---------| | 4K system prompt, 100 requests | $1.20 | $0.24 | 80% | | 10K RAG context, 50 requests | $1.50 | $0.19 | 87% | | 20K tools, 200 requests | $12.00 | $1.35 | 89% | ## Best Practices 1. **Cache the longest, most stable content first** — System prompts and tool definitions change rarely 2. **Order matters** — Cached content must be a prefix. Put cached content before dynamic content 3. **Monitor cache hits** — Check `usage.cache_creation_input_tokens` and `usage.cache_read_input_tokens` in response 4. **Minimum size** — Content must be >= 1,024 tokens to be cacheable 5. **Keep cache warm** — If requests are >5 minutes apart, cache expires ## FAQ **Q: Does caching affect response quality?** A: No. Caching is purely an optimization — the model sees identical input. **Q: Can I cache across different conversations?** A: Yes, if the cached prefix is identical (same system prompt + tools), it hits the cache regardless of the user message. **Q: Does Claude Code use prompt caching?** A: Yes, Claude Code automatically uses prompt caching for CLAUDE.md content, tool definitions, and conversation history. ## Source & Thanks > - [Anthropic Prompt Caching Docs](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) > - [Pricing page](https://anthropic.com/pricing) ## 快速使用 在 system prompt 中添加 `cache_control` 字段,后续请求省 90% 成本。 ## 什么是 Prompt Caching? Anthropic 的提示缓存,将重复使用的内容(系统提示、工具定义、长文档)缓存在请求间复用,缓存读取仅需 1/10 成本。 **一句话总结**:Anthropic 提示缓存,缓存系统提示/工具/文档,读取成本仅 1/10,5 分钟 TTL 自动续期,生产级 Claude 应用必用,最高省 90%。 ## 可缓存内容 1. 系统提示 — 最常见场景 2. 工具定义 — 大量工具时效果显著 3. RAG 文档 — 同一文档多轮问答 4. 多轮对话前缀 — 早期上下文缓存 ## 最佳实践 - 最长最稳定的内容先缓存 - 缓存内容必须是前缀 - 监控 cache_read_input_tokens 确认命中 - 最小 1024 tokens ## 常见问题 **Q: 影响质量?** A: 不影响,模型看到的输入完全一致。 **Q: Claude Code 用了吗?** A: 用了,自动缓存 CLAUDE.md 和工具定义。 ## 来源与致谢 > [Anthropic Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) --- Source: https://tokrepo.com/en/workflows/ed25d3cb-413d-40e5-8c36-a063e8a5ca99 Author: Prompt Lab