Anthropic Prompt Caching — Cut AI API Costs 90%
Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings.
What it is
Anthropic prompt caching is an API feature that lets you cache frequently reused content (system prompts, tool definitions, long documents) across multiple Claude API requests. Cached tokens are read at a fraction of the cost of uncached input tokens, reducing total API spending significantly for applications that reuse the same context.
This feature targets developers building applications that send the same system prompt, tool definitions, or reference documents with every request. Chatbots, code assistants, and RAG pipelines benefit the most because they repeat large context blocks across conversations.
How it saves time or tokens
Without caching, every API request processes the full system prompt and context from scratch. With caching, the first request pays the full price plus a small cache write fee, but all subsequent requests read cached tokens at a 90% discount. For a 10,000-token system prompt sent across 100 requests, you pay for 10,000 tokens once instead of 1,000,000 tokens total. Cached content also reduces latency because the model does not need to reprocess it.
How to use
- Add
cache_controlto content blocks you want cached:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model='claude-sonnet-4-20250514',
max_tokens=1024,
system=[{
'type': 'text',
'text': 'You are an expert code reviewer... (long system prompt)',
'cache_control': {'type': 'ephemeral'}
}],
messages=[{'role': 'user', 'content': 'Review this function.'}]
)
- Cache tool definitions:
response = client.messages.create(
model='claude-sonnet-4-20250514',
max_tokens=1024,
tools=[{
'name': 'search_codebase',
'description': 'Search the codebase for patterns',
'input_schema': {'type': 'object', 'properties': {'query': {'type': 'string'}}},
'cache_control': {'type': 'ephemeral'}
}],
messages=[{'role': 'user', 'content': 'Find all TODO comments.'}]
)
- Check cache usage in the response:
print(response.usage.cache_creation_input_tokens) # Tokens cached on first call
print(response.usage.cache_read_input_tokens) # Tokens read from cache
Example
# Caching a long document for RAG
import anthropic
client = anthropic.Anthropic()
long_document = open('docs/api-reference.md').read()
response = client.messages.create(
model='claude-sonnet-4-20250514',
max_tokens=2048,
system=[{
'type': 'text',
'text': f'Reference document:\n\n{long_document}',
'cache_control': {'type': 'ephemeral'}
}],
messages=[{'role': 'user', 'content': 'What authentication methods does the API support?'}]
)
Related on TokRepo
- AI Tools for API — Tools for working with AI APIs efficiently
- Prompt Library — Reusable prompts and templates
Common pitfalls
- Cache has a minimum token threshold (currently 1024 tokens for Claude Sonnet); content blocks smaller than this threshold will not be cached.
- Cached content expires after a TTL (time-to-live) period; for ephemeral caching, the cache lasts approximately 5 minutes of inactivity. Plan your request frequency accordingly.
- Cache write tokens cost 25% more than regular input tokens; caching only saves money when the same content is reused across multiple requests.
Frequently Asked Questions
Cache read tokens cost approximately 90% less than regular input tokens. For applications that reuse the same system prompt or context across many requests, this translates to significant cost reduction. The exact savings depend on your cache hit rate and the size of cached content.
You can cache system prompts, tool definitions, and content within message blocks. Add a cache_control field with type 'ephemeral' to any content block you want cached. The content must meet the minimum token threshold.
Ephemeral caches last approximately 5 minutes of inactivity. Each cache hit refreshes the TTL. If no requests use the cached content within the TTL window, it expires and the next request incurs a cache write fee.
No. Caching only affects how input tokens are processed and billed. The model produces identical responses whether content is cached or not. Caching is a performance and cost optimization, not a quality tradeoff.
Prompt caching is available on Claude Sonnet, Claude Opus, and Claude Haiku via the Anthropic API. Check the Anthropic documentation for the latest model support and minimum token thresholds.
Citations (3)
- Anthropic Prompt Caching Documentation— Anthropic prompt caching reduces input token costs by up to 90%
- Anthropic API Reference— Cache control is set via the cache_control field in content blocks
- Anthropic Pricing— Claude API models and pricing