Anthropic Prompt Caching — Cut AI API Costs 90%
Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings.
Review-first install path
This asset needs a review step. The copied prompt tells the agent to dry-run, show the writes, then proceed only after confirmation.
npx -y tokrepo@latest install ed25d3cb-413d-40e5-8c36-a063e8a5ca99 --target codexDry-run first, confirm the writes, then run this command.
What it is
Anthropic prompt caching is an API feature that lets you cache frequently reused content (system prompts, tool definitions, long documents) across multiple Claude API requests. Cached tokens are read at a fraction of the cost of uncached input tokens, reducing total API spending significantly for applications that reuse the same context.
This feature targets developers building applications that send the same system prompt, tool definitions, or reference documents with every request. Chatbots, code assistants, and RAG pipelines benefit the most because they repeat large context blocks across conversations.
How it saves time or tokens
Without caching, every API request processes the full system prompt and context from scratch. With caching, the first request pays the full price plus a small cache write fee, but all subsequent requests read cached tokens at a 90% discount. For a 10,000-token system prompt sent across 100 requests, you pay for 10,000 tokens once instead of 1,000,000 tokens total. Cached content also reduces latency because the model does not need to reprocess it.
How to use
- Add
cache_controlto content blocks you want cached:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model='claude-sonnet-4-20250514',
max_tokens=1024,
system=[{
'type': 'text',
'text': 'You are an expert code reviewer... (long system prompt)',
'cache_control': {'type': 'ephemeral'}
}],
messages=[{'role': 'user', 'content': 'Review this function.'}]
)
- Cache tool definitions:
response = client.messages.create(
model='claude-sonnet-4-20250514',
max_tokens=1024,
tools=[{
'name': 'search_codebase',
'description': 'Search the codebase for patterns',
'input_schema': {'type': 'object', 'properties': {'query': {'type': 'string'}}},
'cache_control': {'type': 'ephemeral'}
}],
messages=[{'role': 'user', 'content': 'Find all TODO comments.'}]
)
- Check cache usage in the response:
print(response.usage.cache_creation_input_tokens) # Tokens cached on first call
print(response.usage.cache_read_input_tokens) # Tokens read from cache
Example
# Caching a long document for RAG
import anthropic
client = anthropic.Anthropic()
long_document = open('docs/api-reference.md').read()
response = client.messages.create(
model='claude-sonnet-4-20250514',
max_tokens=2048,
system=[{
'type': 'text',
'text': f'Reference document:\n\n{long_document}',
'cache_control': {'type': 'ephemeral'}
}],
messages=[{'role': 'user', 'content': 'What authentication methods does the API support?'}]
)
Related on TokRepo
- AI Tools for API — Tools for working with AI APIs efficiently
- Prompt Library — Reusable prompts and templates
Common pitfalls
- Cache has a minimum token threshold (currently 1024 tokens for Claude Sonnet); content blocks smaller than this threshold will not be cached.
- Cached content expires after a TTL (time-to-live) period; for ephemeral caching, the cache lasts approximately 5 minutes of inactivity. Plan your request frequency accordingly.
- Cache write tokens cost 25% more than regular input tokens; caching only saves money when the same content is reused across multiple requests.
Frequently Asked Questions
Cache read tokens cost approximately 90% less than regular input tokens. For applications that reuse the same system prompt or context across many requests, this translates to significant cost reduction. The exact savings depend on your cache hit rate and the size of cached content.
You can cache system prompts, tool definitions, and content within message blocks. Add a cache_control field with type 'ephemeral' to any content block you want cached. The content must meet the minimum token threshold.
Ephemeral caches last approximately 5 minutes of inactivity. Each cache hit refreshes the TTL. If no requests use the cached content within the TTL window, it expires and the next request incurs a cache write fee.
No. Caching only affects how input tokens are processed and billed. The model produces identical responses whether content is cached or not. Caching is a performance and cost optimization, not a quality tradeoff.
Prompt caching is available on Claude Sonnet, Claude Opus, and Claude Haiku via the Anthropic API. Check the Anthropic documentation for the latest model support and minimum token thresholds.
Citations (3)
- Anthropic Prompt Caching Documentation— Anthropic prompt caching reduces input token costs by up to 90%
- Anthropic API Reference— Cache control is set via the cache_control field in content blocks
- Anthropic Pricing— Claude API models and pricing
Related on TokRepo
Source & Thanks
Discussion
Related Assets
Anthropic Prompt Engineering Guide — Official Best Practices
Official prompting guide from Anthropic for Claude. System prompts, chain-of-thought, few-shot, XML tags, tool use, and advanced techniques.
Anthropic Cookbook — Official Claude Recipes
Official collection of notebooks and recipes for building with Claude. Prompt engineering, tool use, RAG, agents, multimodal, and enterprise patterns. By Anthropic. 37K+ stars.
api-relay-audit — Audit AI API Relays for Prompt Attacks
Local 13-step audit for AI API relays/proxies: injection/leakage, context truncation, tool rewriting; verified 419★, pushed 2026-05-11.
AI Prompt Engineering Best Practices Guide
Comprehensive guide to writing effective prompts for Claude, GPT, and Gemini. Covers system prompts, few-shot learning, chain-of-thought, and structured output techniques.