Cette page est affichée en anglais. Une traduction française est en cours.
KnowledgeMay 8, 2026·4 min de lecture

Helicone Cache — Cut LLM Spend with Drop-In Response Caching

Helicone Cache short-circuits identical LLM requests at the proxy. Set Helicone-Cache-Enabled header, exact-match responses come back in ms at zero cost.

Helicone
Helicone · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose la commande CLI, le metadata JSON, le plan d'installation et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 96/100Policy : autoriser
Cible
Claude Code, Codex, Gemini CLI
Type
Knowledge
Installation
Single
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI d'installation
npx tokrepo install 5d1acc2e-f42d-4fce-aec7-771506f858ae --target codex
Introduction

Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute.


Enable cache

from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # cache for 1 hour
    },
)

# First call hits the LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# Second identical call returns from cache — same content, $0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

The response includes a Helicone-Cache: HIT header so you know which calls were free.

Per-request override

# Override TTL for one call
extra_headers = {"Cache-Control": "max-age=86400"}  # 24h for this one

Bucket size for diversity

# Allow 3 distinct cached responses per prompt (round-robin)
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}

Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time.

What gets cached

Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for:

  • Stable system prompts (e.g. classification with fixed instructions)
  • Batch evaluations on a fixed set of inputs
  • Internal tooling (slack bots, etc) that asks repeated questions

Not useful for high-temperature creative generation where you actually want variety.


FAQ

Q: Is Helicone Cache free? A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month.

Q: How does this differ from prompt caching (Anthropic / OpenAI)? A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings.

Q: Can I see cache hit rate? A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature).


Quick Use

  1. Already have Helicone proxy URL set in your LLM SDK?
  2. Add header Helicone-Cache-Enabled: true
  3. Optional: Cache-Control: max-age=3600 to set TTL

Intro

Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute.


Enable cache

from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # cache for 1 hour
    },
)

# First call hits the LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# Second identical call returns from cache — same content, $0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

The response includes a Helicone-Cache: HIT header so you know which calls were free.

Per-request override

# Override TTL for one call
extra_headers = {"Cache-Control": "max-age=86400"}  # 24h for this one

Bucket size for diversity

# Allow 3 distinct cached responses per prompt (round-robin)
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}

Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time.

What gets cached

Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for:

  • Stable system prompts (e.g. classification with fixed instructions)
  • Batch evaluations on a fixed set of inputs
  • Internal tooling (slack bots, etc) that asks repeated questions

Not useful for high-temperature creative generation where you actually want variety.


FAQ

Q: Is Helicone Cache free? A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month.

Q: How does this differ from prompt caching (Anthropic / OpenAI)? A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings.

Q: Can I see cache hit rate? A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature).


Source & Thanks

Built by Helicone. Licensed under Apache-2.0.

Helicone/helicone — ⭐ 4,000+

🙏

Source et remerciements

Built by Helicone. Licensed under Apache-2.0.

Helicone/helicone — ⭐ 4,000+

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires