Cloudflare AI Gateway — Edge Proxy for LLM Traffic
Cloudflare AI Gateway is a free edge proxy that sits between your app and LLM providers — caching responses, rate-limiting abuse, failover across models, and emitting analytics without changing your SDK code.
Why Cloudflare AI Gateway
The cheapest "I need production-grade LLM infrastructure right now" answer. Cloudflare AI Gateway is free at the Workers free tier, deploys in minutes, and supports OpenAI, Anthropic, Gemini, Groq, Mistral, Workers AI, and a dozen other providers without SDK changes — just change the base URL.
The trade-off is opinionated simplicity. You get caching, rate-limiting, retry/fallback, and a dashboard with request logs and spend tracking. You don’t get Portkey’s prompt management, LiteLLM’s extensive routing rules, or Langfuse-depth traces. For a startup shipping its first LLM feature, that trade is almost always correct.
The Cloudflare edge network is the hidden benefit. Because the gateway runs at 300+ POPs, LLM requests hit a nearby Cloudflare edge first, then Cloudflare reaches out to the provider from a warm connection. On cache hits (a surprising fraction of real traffic) you return in milliseconds without hitting the provider at all.
Quick Start — Switch Base URL, Nothing Else
The only change is base_url. The gateway supports OpenAI, Anthropic, Gemini, Workers AI, Groq, Mistral, Perplexity, HuggingFace, Replicate, Cohere, Azure, AWS Bedrock, and Vertex AI — each under its own path segment. Caching, retries, and fallbacks are configured in the dashboard, not in code.
# 1. In Cloudflare dashboard: AI → AI Gateway → Create gateway
# → You get a base URL like https://gateway.ai.cloudflare.com/v1/<account>/<gateway>
#
# 2. Point your SDK at it. Everything else stays the same.
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="https://gateway.ai.cloudflare.com/v1/<account>/<gateway>/openai",
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize the AI gateway category."}],
# Cache identical requests for 1 hour
extra_headers={"cf-aig-cache-ttl": "3600"},
)
print(resp.choices[0].message.content)
# Anthropic? Same gateway, different path segment:
# base_url="https://gateway.ai.cloudflare.com/v1/<account>/<gateway>/anthropic"
# Dashboard now shows logs, cache hits, per-provider spend, and failure rates.Key Features
Drop-in base URL
No SDK change, no new client library. Your existing OpenAI/Anthropic code keeps working after you change the base URL. Zero risk migration.
Semantic + exact cache
Identical requests are cached by default. Semantic cache (paid) matches near-duplicate prompts via embeddings — typical ~20-40% hit rate on real traffic.
Per-provider fallback
Configure automatic failover: try Anthropic first, fall back to OpenAI on timeout or 5xx. Reduces incident impact without client-side code.
Rate limits by user / route
Cap request volume per custom identifier (user ID, API key). Useful for free-tier products and abuse prevention. Configurable per gateway.
Request logs & spend dashboard
Every request logged with prompt, response, latency, cost. Filter by model, status, custom tags. Adequate for ops; not a replacement for Langfuse-depth tracing.
Edge network performance
Gateway runs at 300+ POPs. Cache hits return in ~10ms regardless of provider region. Even misses benefit from Cloudflare’s warm upstream connections.
Comparison
| Deployment | Cost | Prompt Mgmt | Observability Depth | |
|---|---|---|---|---|
| Cloudflare AI Gatewaythis | Managed edge | Free tier + pay-as-you-go | No | Basic (logs, spend) |
| Portkey | Managed + self-host | Paid plans | Yes (versioning + A/B) | Medium |
| LiteLLM Proxy | Self-host | Free (OSS) | Partial | Integrates with Langfuse |
| Kong AI Gateway | Self-host enterprise | Kong license | Via Kong plugins | Via Kong ecosystem |
Use Cases
01. Early-stage startups
First LLM feature shipped. Cloudflare gateway adds caching, failover, and cost visibility in an afternoon — before you need a dedicated observability stack.
02. High-traffic consumer apps
When a significant fraction of prompts are near-duplicates (chatbots, search suggestions), Cloudflare’s edge cache saves both latency and LLM spend.
03. Teams already on Cloudflare
Workers, Pages, D1, R2 users get native integration. AI Gateway fits into the existing Cloudflare account, bindings, and observability — no new vendor.
Pricing & License
Free tier: the first 100K logged requests per month are free. Unlogged requests (pure passthrough) have no hard cap but may be rate-limited under extreme load.
Paid tier: usage-based beyond the free tier. Semantic caching and extended log retention are paid add-ons. Current pricing on Cloudflare docs.
Hidden savings: the most impactful "cost" of this product is negative — cache hits reduce your LLM bill directly. A startup chat app paying $2K/month on OpenAI can cut that 15-30% by enabling aggressive caching on repeated prompts.
Related Assets on TokRepo
Webstudio — Open Source Visual Website Builder
Webstudio is an open-source Webflow alternative with a visual drag-and-drop editor, full CSS support, headless CMS integration, and self-hosting on Cloudflare.
Wrangler MCP — Cloudflare Workers for AI Agents
MCP server for managing Cloudflare Workers, KV, R2, and D1 from AI agents. Deploy serverless functions, manage storage, and query databases through Claude Code tool calls.
Cloudflare Workers AI — Serverless AI Inference
Run AI models at the edge with Cloudflare Workers. Text generation, image generation, speech-to-text, translation, embeddings — all serverless with global distribution.
ClickHouse — Open Source Real-Time Analytics Database
ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay.
Frequently Asked Questions
Is Cloudflare AI Gateway really free?+
The free tier covers 100K logged requests per month, which is enough for most small-to-mid apps. Beyond that, pricing is usage-based. Unlogged passthrough is uncapped but unmonitored — most teams log everything.
Does it support Anthropic Claude?+
Yes. Supported providers in 2026 include OpenAI, Anthropic, Google Gemini, Groq, Mistral, Workers AI, Cohere, HuggingFace, Replicate, Perplexity, Azure OpenAI, AWS Bedrock, and Vertex AI. Each sits under its own path segment of the gateway URL.
How does semantic caching work?+
Instead of exact-match caching, semantic cache embeds the incoming prompt and matches against embeddings of recent prompts. When a close enough match is found (configurable threshold), the cached response is returned. Typical hit rates: 20-40% on high-repetition workloads. Embedding cost is small relative to skipped LLM calls.
Is this a full observability platform?+
No — it’s a gateway with basic observability. For deeper tracing (tool calls, chains, spans), pair Cloudflare AI Gateway with Langfuse or Helicone. Cloudflare handles ingress and caching; Langfuse handles structured traces and evals.
Can I run a self-hosted version?+
No. Cloudflare AI Gateway is a managed product. For self-hosted alternatives, look at LiteLLM Proxy or Kong AI Gateway. Many teams run both — Cloudflare at the edge for global caching, LiteLLM for internal routing policies.
Does it work with the Vercel AI SDK or LangChain?+
Yes. Both libraries accept a custom baseURL for OpenAI-compatible providers. Point them at your Cloudflare gateway URL and the rest works unchanged.