Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 6, 2026·2 min de lectura

Cloudflare AI Gateway — LLM Proxy, Cache & Analytics

Free proxy gateway for LLM API calls with caching, rate limiting, cost tracking, and fallback routing across providers. Reduce costs up to 95% with response caching. 7,000+ stars.

Cloudflare · Community

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Community

Entrada

Cloudflare AI Gateway — LLM Proxy, Cache & Analytics

Comando de instalación directa

npx -y tokrepo@latest install b1962c77-9ecf-4a84-87b1-e7d4b677dabe --target codex

Ejecutar después de confirmar el plan con dry-run.

TL;DR

Free proxy gateway for LLM API calls that adds caching, rate limiting, cost tracking, and fallback routing across providers.

§01

What it is

Cloudflare AI Gateway is a free proxy gateway for LLM API calls. It sits between your application and AI providers (OpenAI, Anthropic, Google, etc.), adding response caching, rate limiting, cost tracking, fallback routing, and analytics. You change one URL in your code and get visibility and control over all your AI API usage.

It is designed for engineering teams running production AI applications who need cost control, observability, and resilience without building custom proxy infrastructure.

§02

How it saves time or tokens

The token estimate for this workflow is 2,600 tokens. Response caching can reduce costs by up to 95% for repeated queries. Rate limiting prevents budget overruns. Fallback routing automatically switches to a backup provider if your primary one goes down, improving reliability without code changes.

§03

How to use

Sign up at dash.cloudflare.com (free tier available)
Navigate to AI > AI Gateway > Create Gateway
Replace your API base URL with the Cloudflare gateway URL
Your existing code and API keys continue to work

§04

Example

from openai import OpenAI

# Before: direct to OpenAI
# client = OpenAI(base_url='https://api.openai.com/v1')

# After: route through Cloudflare AI Gateway
client = OpenAI(
    base_url='https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_name}/openai'
)
# Same API key, same code -- now with caching, logging, and analytics

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': 'Explain caching'}]
)

# Universal endpoint for any provider
curl https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_name}/openai/chat/completions \
  -H 'Authorization: Bearer sk-...' \
  -H 'Content-Type: application/json' \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}'

§05

Related on TokRepo

Cloudflare AI Gateway Deep Dive -- Detailed configuration guide
AI Gateway Providers Compared -- Compare all AI gateway options

§06

Common pitfalls

Caching only works for identical requests; small prompt variations generate cache misses
The gateway adds a small latency overhead (typically under 50ms) for the proxy hop
Fallback routing requires configuring multiple providers; it does not auto-discover alternatives

Preguntas frecuentes

Is Cloudflare AI Gateway free?+

Yes. The core gateway features (caching, rate limiting, analytics) are available on the free tier. Advanced features and higher request volumes may require a paid Cloudflare plan.

Which AI providers does it support?+

Cloudflare AI Gateway supports OpenAI, Anthropic, Google (Gemini), Azure OpenAI, HuggingFace, Replicate, and other major providers. The universal endpoint format works with any OpenAI-compatible API.

How does response caching work?+

When a request matches a previous identical request (same model, messages, and parameters), the gateway returns the cached response without calling the provider. You configure cache TTL and can invalidate the cache manually.

Can I use it with streaming responses?+

Yes. Cloudflare AI Gateway supports streaming (SSE) responses. Cached streaming responses are replayed as if they were live, maintaining the same chunked delivery format.

Does it work with self-hosted models?+

The gateway is designed for cloud AI providers. For self-hosted models, you would need to expose them behind an OpenAI-compatible API and configure a custom provider endpoint in the gateway.

Referencias (3)

Cloudflare AI Gateway Docs— Cloudflare AI Gateway provides caching, rate limiting, and analytics for LLM API…
Cloudflare Blog— Response caching can reduce costs by up to 95%
Cloudflare AI Gateway Providers— Supports OpenAI, Anthropic, Google, and other providers

Relacionados en TokRepo

Cloudflare AI Gateway AI Gateway Providers Featured Workflows

🙏

Fuente y agradecimientos

Created by Cloudflare. Licensed under Apache 2.0.

ai-gateway — ⭐ 7,000+

Thanks to Cloudflare for making LLM cost control accessible to every developer.

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

LLM Gateway Comparison — Proxy Your AI Requests

Compare top LLM gateway and proxy tools for routing AI requests. Covers LiteLLM, Bifrost, Portkey, and OpenRouter for cost optimization, failover, and multi-provider access.

Skills

Agent Toolkit

Pingora — Fast Programmable HTTP Proxy Framework by Cloudflare

Pingora is a Rust framework for building fast, reliable, and programmable network services. Open-sourced by Cloudflare, it powers a significant portion of their HTTP traffic, handling over a trillion requests daily across the global network.

Skills

Cloudflare

grpc-gateway — RESTful JSON API Proxy for gRPC Services

A protoc plugin and reverse proxy that generates a RESTful HTTP/JSON API gateway from gRPC service definitions, letting clients use REST while backends speak gRPC.

Skills

Script Depot

LMCache — Supercharge LLM Inference with KV Cache Sharing

LMCache is an open-source KV cache management layer that accelerates LLM inference by sharing and reusing key-value caches across requests, reducing time-to-first-token and GPU memory usage.

Skills

AI Open Source