KnowledgeMay 8, 2026·4 min read

Helicone Cache — Cut LLM Spend with Drop-In Response Caching

Helicone Cache short-circuits identical LLM requests at the proxy. Set Helicone-Cache-Enabled header, exact-match responses come back in ms at zero cost.

Intro

Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute.


Enable cache

from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # cache for 1 hour
    },
)

# First call hits the LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# Second identical call returns from cache — same content, $0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

The response includes a Helicone-Cache: HIT header so you know which calls were free.

Per-request override

# Override TTL for one call
extra_headers = {"Cache-Control": "max-age=86400"}  # 24h for this one

Bucket size for diversity

# Allow 3 distinct cached responses per prompt (round-robin)
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}

Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time.

What gets cached

Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for:

  • Stable system prompts (e.g. classification with fixed instructions)
  • Batch evaluations on a fixed set of inputs
  • Internal tooling (slack bots, etc) that asks repeated questions

Not useful for high-temperature creative generation where you actually want variety.


FAQ

Q: Is Helicone Cache free? A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month.

Q: How does this differ from prompt caching (Anthropic / OpenAI)? A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings.

Q: Can I see cache hit rate? A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature).


Quick Use

  1. Already have Helicone proxy URL set in your LLM SDK?
  2. Add header Helicone-Cache-Enabled: true
  3. Optional: Cache-Control: max-age=3600 to set TTL

Intro

Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute.


Enable cache

from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # cache for 1 hour
    },
)

# First call hits the LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# Second identical call returns from cache — same content, $0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

The response includes a Helicone-Cache: HIT header so you know which calls were free.

Per-request override

# Override TTL for one call
extra_headers = {"Cache-Control": "max-age=86400"}  # 24h for this one

Bucket size for diversity

# Allow 3 distinct cached responses per prompt (round-robin)
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}

Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time.

What gets cached

Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for:

  • Stable system prompts (e.g. classification with fixed instructions)
  • Batch evaluations on a fixed set of inputs
  • Internal tooling (slack bots, etc) that asks repeated questions

Not useful for high-temperature creative generation where you actually want variety.


FAQ

Q: Is Helicone Cache free? A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month.

Q: How does this differ from prompt caching (Anthropic / OpenAI)? A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings.

Q: Can I see cache hit rate? A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature).


Source & Thanks

Built by Helicone. Licensed under Apache-2.0.

Helicone/helicone — ⭐ 4,000+

🙏

Source & Thanks

Built by Helicone. Licensed under Apache-2.0.

Helicone/helicone — ⭐ 4,000+

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets