# Anthropic Prompt Caching — Cut AI API Costs 90%

> Use Anthropic's prompt caching to reduce Claude API costs by up to 90%. Cache system prompts, tool definitions, and long documents across requests for massive savings.

## Install

Paste the prompt below into your AI tool:

## Quick Use

```python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": "You are an expert code reviewer... (long system prompt)",
        "cache_control": {"type": "ephemeral"},  # Cache this!
    }],
    messages=[{"role": "user", "content": "Review this code..."}],
)
# First request: full price
# Subsequent requests: 90% cheaper for cached portion
```

## What is Prompt Caching?

Prompt caching lets you cache frequently reused content (system prompts, tool definitions, large documents) across API requests. Instead of re-processing the same tokens every time, Claude reads them from cache at 1/10th the cost. For applications with long system prompts or RAG context, this can reduce costs by 90%.

**Answer-Ready**: Anthropic's prompt caching reduces Claude API costs up to 90%. Cache system prompts, tool definitions, and documents across requests. Cached tokens cost 1/10th of input tokens. 5-minute TTL, auto-extended on hit. Essential for production AI applications with recurring context.

**Best for**: Teams running Claude API at scale with recurring system prompts. **Works with**: Claude Sonnet, Opus, Haiku via Anthropic API. **Setup time**: Add one field to existing code.

## How It Works

### Pricing Impact

| Token Type | Cost (Sonnet) | Savings |
|-----------|---------------|---------|
| Input (no cache) | $3/M tokens | Baseline |
| Cache write | $3.75/M tokens | -25% first time |
| Cache read | $0.30/M tokens | **90% savings** |
| Output | $15/M tokens | No change |

### Cache TTL
- Default: 5 minutes
- Auto-extended: Each cache hit resets the 5-minute timer
- Minimum cacheable: 1,024 tokens (Sonnet), 2,048 tokens (Opus)

## Cacheable Content

### 1. System Prompts

```python
system=[{
    "type": "text",
    "text": "Your 2000-token system prompt here...",
    "cache_control": {"type": "ephemeral"},
}]
```

### 2. Tool Definitions

```python
tools = [
    {"name": "search", "description": "...", "input_schema": {...},
     "cache_control": {"type": "ephemeral"}},
]
```

### 3. Long Documents (RAG Context)

```python
messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Here is the full codebase:\
" + large_document,
         "cache_control": {"type": "ephemeral"}},
        {"type": "text", "text": "Now answer: what does the auth module do?"},
    ]},
]
```

### 4. Multi-Turn with Cached Prefix

```python
# Turn 1: Cache the document
msg1 = {"role": "user", "content": [
    {"type": "text", "text": document, "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": "Summarize this document"},
]}

# Turn 2: Document is already cached, only new question costs full price
msg2 = {"role": "user", "content": "Now list the key findings"}
```

## Cost Calculator

| Scenario | Without Cache | With Cache | Savings |
|----------|--------------|------------|---------|
| 4K system prompt, 100 requests | $1.20 | $0.24 | 80% |
| 10K RAG context, 50 requests | $1.50 | $0.19 | 87% |
| 20K tools, 200 requests | $12.00 | $1.35 | 89% |

## Best Practices

1. **Cache the longest, most stable content first** — System prompts and tool definitions change rarely
2. **Order matters** — Cached content must be a prefix. Put cached content before dynamic content
3. **Monitor cache hits** — Check `usage.cache_creation_input_tokens` and `usage.cache_read_input_tokens` in response
4. **Minimum size** — Content must be >= 1,024 tokens to be cacheable
5. **Keep cache warm** — If requests are >5 minutes apart, cache expires

## FAQ

**Q: Does caching affect response quality?**
A: No. Caching is purely an optimization — the model sees identical input.

**Q: Can I cache across different conversations?**
A: Yes, if the cached prefix is identical (same system prompt + tools), it hits the cache regardless of the user message.

**Q: Does Claude Code use prompt caching?**
A: Yes, Claude Code automatically uses prompt caching for CLAUDE.md content, tool definitions, and conversation history.

## Source & Thanks

> - [Anthropic Prompt Caching Docs](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
> - [Pricing page](https://anthropic.com/pricing)

<!-- ZH -->

## Quick Use

Add a `cache_control` field to the system prompt — subsequent requests save 90% on cost.

## What is Prompt Caching?

Anthropic's prompt caching caches repeated content (system prompts, tool definitions, long documents) for reuse across requests; cache reads cost only 1/10.

**TL;DR**: Anthropic prompt caching. Cache system prompts / tools / documents. Read cost is 1/10. 5-minute TTL auto-renews. Must-use for production Claude apps. Up to 90% savings.

## Cacheable Content

1. System prompts — most common case
2. Tool definitions — big wins with many tools
3. RAG documents — multi-turn Q&A over the same doc
4. Multi-turn conversation prefix — cache early context

## Best Practices

- Cache the longest, most stable content first
- Cached content must be a prefix
- Monitor cache_read_input_tokens to confirm hits
- Minimum 1024 tokens

## FAQ

**Q: Does it affect quality?**
A: No — the model sees identical input either way.

**Q: Does Claude Code use it?**
A: Yes — CLAUDE.md and tool definitions are cached automatically.

## Source & Thanks

> [Anthropic Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)

---
Source: https://tokrepo.com/en/workflows/anthropic-prompt-caching-cut-ai-api-costs-90-ed25d3cb
Author: Anthropic