Quick Use
- Already have Helicone proxy URL set in your LLM SDK?
- Add header
Helicone-Cache-Enabled: true - Optional:
Cache-Control: max-age=3600to set TTL
Intro
Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute.
Enable cache
from openai import OpenAI
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": f"Bearer {HELICONE_KEY}",
"Helicone-Cache-Enabled": "true",
"Cache-Control": "max-age=3600", # cache for 1 hour
},
)
# First call hits the LLM
resp1 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
# Second identical call returns from cache — same content, $0
resp2 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)The response includes a Helicone-Cache: HIT header so you know which calls were free.
Per-request override
# Override TTL for one call
extra_headers = {"Cache-Control": "max-age=86400"} # 24h for this oneBucket size for diversity
# Allow 3 distinct cached responses per prompt (round-robin)
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time.
What gets cached
Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for:
- Stable system prompts (e.g. classification with fixed instructions)
- Batch evaluations on a fixed set of inputs
- Internal tooling (slack bots, etc) that asks repeated questions
Not useful for high-temperature creative generation where you actually want variety.
FAQ
Q: Is Helicone Cache free? A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month.
Q: How does this differ from prompt caching (Anthropic / OpenAI)? A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings.
Q: Can I see cache hit rate? A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature).
Source & Thanks
Built by Helicone. Licensed under Apache-2.0.
Helicone/helicone — ⭐ 4,000+