AI Gateway

Cloudflare AI Gateway — Edge Proxy for LLM Traffic

Cloudflare AI Gateway is a free edge proxy that sits between your app and LLM providers — caching responses, rate-limiting abuse, failover across models, and emitting analytics without changing your SDK code.

Official Site

Why Cloudflare AI Gateway

The cheapest "I need production-grade LLM infrastructure right now" answer. Cloudflare AI Gateway is free at the Workers free tier, deploys in minutes, and supports OpenAI, Anthropic, Gemini, Groq, Mistral, Workers AI, and a dozen other providers without SDK changes — just change the base URL.

The trade-off is opinionated simplicity. You get caching, rate-limiting, retry/fallback, and a dashboard with request logs and spend tracking. You don’t get Portkey’s prompt management, LiteLLM’s extensive routing rules, or Langfuse-depth traces. For a startup shipping its first LLM feature, that trade is almost always correct.

The Cloudflare edge network is the hidden benefit. Because the gateway runs at 300+ POPs, LLM requests hit a nearby Cloudflare edge first, then Cloudflare reaches out to the provider from a warm connection. On cache hits (a surprising fraction of real traffic) you return in milliseconds without hitting the provider at all.

Quick Start — Switch Base URL, Nothing Else

The only change is base_url. The gateway supports OpenAI, Anthropic, Gemini, Workers AI, Groq, Mistral, Perplexity, HuggingFace, Replicate, Cohere, Azure, AWS Bedrock, and Vertex AI — each under its own path segment. Caching, retries, and fallbacks are configured in the dashboard, not in code.

# 1. In Cloudflare dashboard: AI → AI Gateway → Create gateway
#    → You get a base URL like https://gateway.ai.cloudflare.com/v1/<account>/<gateway>
#
# 2. Point your SDK at it. Everything else stays the same.

from openai import OpenAI

client = OpenAI(
    api_key="sk-...",
    base_url="https://gateway.ai.cloudflare.com/v1/<account>/<gateway>/openai",
)

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize the AI gateway category."}],
    # Cache identical requests for 1 hour
    extra_headers={"cf-aig-cache-ttl": "3600"},
)
print(resp.choices[0].message.content)

# Anthropic? Same gateway, different path segment:
# base_url="https://gateway.ai.cloudflare.com/v1/<account>/<gateway>/anthropic"
# Dashboard now shows logs, cache hits, per-provider spend, and failure rates.

Key Features

Drop-in base URL

No SDK change, no new client library. Your existing OpenAI/Anthropic code keeps working after you change the base URL. Zero risk migration.

Semantic + exact cache

Identical requests are cached by default. Semantic cache (paid) matches near-duplicate prompts via embeddings — typical ~20-40% hit rate on real traffic.

Per-provider fallback

Configure automatic failover: try Anthropic first, fall back to OpenAI on timeout or 5xx. Reduces incident impact without client-side code.

Rate limits by user / route

Cap request volume per custom identifier (user ID, API key). Useful for free-tier products and abuse prevention. Configurable per gateway.

Request logs & spend dashboard

Every request logged with prompt, response, latency, cost. Filter by model, status, custom tags. Adequate for ops; not a replacement for Langfuse-depth tracing.

Edge network performance

Gateway runs at 300+ POPs. Cache hits return in ~10ms regardless of provider region. Even misses benefit from Cloudflare’s warm upstream connections.

Comparison

	Deployment	Cost	Prompt Mgmt	Observability Depth
Cloudflare AI Gatewaythis	Managed edge	Free tier + pay-as-you-go	No	Basic (logs, spend)
Portkey	Managed + self-host	Paid plans	Yes (versioning + A/B)	Medium
LiteLLM Proxy	Self-host	Free (OSS)	Partial	Integrates with Langfuse
Kong AI Gateway	Self-host enterprise	Kong license	Via Kong plugins	Via Kong ecosystem

Use Cases

01. Early-stage startups

First LLM feature shipped. Cloudflare gateway adds caching, failover, and cost visibility in an afternoon — before you need a dedicated observability stack.

02. High-traffic consumer apps

When a significant fraction of prompts are near-duplicates (chatbots, search suggestions), Cloudflare’s edge cache saves both latency and LLM spend.

03. Teams already on Cloudflare

Workers, Pages, D1, R2 users get native integration. AI Gateway fits into the existing Cloudflare account, bindings, and observability — no new vendor.

Pricing & License

Free tier: the first 100K logged requests per month are free. Unlogged requests (pure passthrough) have no hard cap but may be rate-limited under extreme load.

Paid tier: usage-based beyond the free tier. Semantic caching and extended log retention are paid add-ons. Current pricing on Cloudflare docs.

Hidden savings: the most impactful "cost" of this product is negative — cache hits reduce your LLM bill directly. A startup chat app paying $2K/month on OpenAI can cut that 15-30% by enabling aggressive caching on repeated prompts.

Related Assets on TokRepo

Cloudflare Skills — Workers & Agents Playbook

Install Cloudflare Skills to guide your agent through Workers, D1, R2, and the Agents SDK with consistent best practices and copy-ready commands.

Cloudflare Agents — Stateful Agents on Durable Objects

Cloudflare Agents provides stateful execution environments for agent workloads on Durable Objects, with scheduling, realtime, MCP, and Workers deployment.

Cloudflare AI Workers — Deploy AI Apps at the Edge

Run AI models on Cloudflare's global edge network. Workers AI provides serverless inference for LLMs, embeddings, image generation, and speech-to-text at low latency.

Cloudflare Workers MCP — Edge Functions for AI Agents

MCP server that gives AI agents access to Cloudflare Workers for deploying edge functions, managing KV storage, R2 buckets, and D1 databases. Build and deploy serverless code from chat. 1,500+ stars.

Frequently Asked Questions

Is Cloudflare AI Gateway really free?+

The free tier covers 100K logged requests per month, which is enough for most small-to-mid apps. Beyond that, pricing is usage-based. Unlogged passthrough is uncapped but unmonitored — most teams log everything.

Does it support Anthropic Claude?+

Yes. Supported providers in 2026 include OpenAI, Anthropic, Google Gemini, Groq, Mistral, Workers AI, Cohere, HuggingFace, Replicate, Perplexity, Azure OpenAI, AWS Bedrock, and Vertex AI. Each sits under its own path segment of the gateway URL.

How does semantic caching work?+

Instead of exact-match caching, semantic cache embeds the incoming prompt and matches against embeddings of recent prompts. When a close enough match is found (configurable threshold), the cached response is returned. Typical hit rates: 20-40% on high-repetition workloads. Embedding cost is small relative to skipped LLM calls.

Is this a full observability platform?+

No — it’s a gateway with basic observability. For deeper tracing (tool calls, chains, spans), pair Cloudflare AI Gateway with Langfuse or Helicone. Cloudflare handles ingress and caching; Langfuse handles structured traces and evals.

Can I run a self-hosted version?+

No. Cloudflare AI Gateway is a managed product. For self-hosted alternatives, look at LiteLLM Proxy or Kong AI Gateway. Many teams run both — Cloudflare at the edge for global caching, LiteLLM for internal routing policies.

Does it work with the Vercel AI SDK or LangChain?+

Yes. Both libraries accept a custom baseURL for OpenAI-compatible providers. Point them at your Cloudflare gateway URL and the rest works unchanged.

Compare Alternatives

Portkey — AI Gateway with Prompt Management & Observability LiteLLM — Open-source LLM Proxy for 100+ Providers OpenRouter — Unified API for 300+ Models, One Invoice Helicone — Zero-Code LLM Observability Platform