How do I install Helicone Cache — Cut LLM Spend with Drop-In Response Caching?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Helicone Cache — Cut LLM Spend with Drop-In Response Caching

Name: Helicone Cache — Cut LLM Spend with Drop-In Response Caching
Author: Helicone

from openai import OpenAI client = OpenAI( base_url="https://oai.helicone.ai/v1", default_headers={ "Helicone-Auth": f"Bearer {HELICONE_KEY}", "Helicone-Cache-Enabled": "true", "Cache-Control": "max-age=3600", # cache for 1 hour }, ) # First call hits the LLM resp1 = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], ) # Second identical call returns from cache — same content, $0 resp2 = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], )

Quick Use

Already have Helicone proxy URL set in your LLM SDK?
Add header Helicone-Cache-Enabled: true
Optional: Cache-Control: max-age=3600 to set TTL

Intro

Helicone Cache short-circuits identical LLM requests at the proxy layer — same prompt + same model = cached response, no upstream call, zero LLM cost. Set one header, get sub-millisecond responses on cache hits. Best for: production apps where the same prompt repeats (system instructions, common queries, batch evaluations). Works with: any LLM provider Helicone proxies. Setup time: 1 minute.

Enable cache

from openai import OpenAI

client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",  # cache for 1 hour
    },
)

# First call hits the LLM
resp1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

# Second identical call returns from cache — same content, $0
resp2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)

The response includes a Helicone-Cache: HIT header so you know which calls were free.

Per-request override

# Override TTL for one call
extra_headers = {"Cache-Control": "max-age=86400"}  # 24h for this one

Bucket size for diversity

# Allow 3 distinct cached responses per prompt (round-robin)
extra_headers = {"Helicone-Cache-Bucket-Max-Size": "3"}

Useful when you want some variety on common prompts (e.g. greeting messages) without paying for fresh inference each time.

What gets cached

Cache key = method + URL + body (model, messages, temperature, etc). Any change to ANY parameter is a cache miss. Useful for:

Stable system prompts (e.g. classification with fixed instructions)
Batch evaluations on a fixed set of inputs
Internal tooling (slack bots, etc) that asks repeated questions

Not useful for high-temperature creative generation where you actually want variety.

FAQ

Q: Is Helicone Cache free? A: Yes — Cache is part of the Helicone free tier. Cached responses count toward your request quota but don't trigger upstream LLM costs. Free tier covers 10K cached requests/month.

Q: How does this differ from prompt caching (Anthropic / OpenAI)? A: Native prompt caching reuses the prefix of a prompt to cut input token costs. Helicone Cache short-circuits the entire call when prompts are identical, returning the previous full response. They're complementary — use both for max savings.

Q: Can I see cache hit rate? A: Yes — Helicone dashboard shows cache hits/misses per project, model, and time. Use it to find prompts that should be cached (high repeat rate, high cost) or shouldn't be (low repeat, high temperature).

Source & Thanks

Built by Helicone. Licensed under Apache-2.0.

Helicone/helicone — ⭐ 4,000+

Helicone Cache — Cut LLM Spend with Drop-In Response Caching

Cet actif peut être lu et installé directement par les agents

Enable cache

Per-request override

Bucket size for diversity

What gets cached

FAQ

Quick Use

Intro

Enable cache

Per-request override

Bucket size for diversity

What gets cached

FAQ

Source & Thanks

Source et remerciements

Fil de discussion

Actifs similaires

Helicone Sessions — Group LLM Calls by User Conversation

PostHog LLM Observability — Track AI Agents in Production

Weave — Trace and Debug LLM Apps

Cherry Studio Knowledge Base — Local RAG with 50+ Formats