ConfigsApr 7, 2026·2 min read

Cloudflare AI Workers — Deploy AI Apps at the Edge

Run AI models on Cloudflare's global edge network. Workers AI provides serverless inference for LLMs, embeddings, image generation, and speech-to-text at low latency.

AI
AI Open Source · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

npm create cloudflare@latest my-ai-app
cd my-ai-app
export default {
  async fetch(request, env) {
    const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [{ role: "user", content: "What is Cloudflare?" }],
    });
    return Response.json(response);
  },
};
npx wrangler deploy

What is Cloudflare Workers AI?

Workers AI lets you run AI models on Cloudflare's global edge network — 300+ cities worldwide. It provides serverless inference for LLMs, text embeddings, image generation, speech-to-text, and more with no GPU management, automatic scaling, and pay-per-request pricing.

Answer-Ready: Cloudflare Workers AI provides serverless AI inference on a global edge network (300+ cities). Run Llama, Mistral, Stable Diffusion, and Whisper models with no GPU management, auto-scaling, and pay-per-request pricing.

Best for: Developers building AI features who want low-latency, serverless deployment. Works with: Llama 3, Mistral, Stable Diffusion, Whisper, BAAI embeddings. Setup time: Under 5 minutes.

Core Features

1. Pre-Built Model Catalog

Category Models
Text Generation Llama 3.1 (8B/70B), Mistral 7B, Gemma
Embeddings BAAI bge-base, bge-large
Image Generation Stable Diffusion XL, FLUX.1
Speech-to-Text Whisper
Translation Meta M2M-100
Classification BERT, DistilBERT

2. Vectorize (Built-In Vector DB)

// Create index
const index = env.VECTORIZE_INDEX;

// Insert embeddings
const embedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["document text here"],
});
await index.upsert([{ id: "doc1", values: embedding.data[0], metadata: { title: "..." } }]);

// Query
const results = await index.query(queryVector, { topK: 5 });

3. AI Gateway

Route, cache, and monitor AI API calls:

const response = await fetch("https://gateway.ai.cloudflare.com/v1/{account}/my-gateway/openai/chat/completions", {
  method: "POST",
  headers: { "Authorization": "Bearer sk-...", "Content-Type": "application/json" },
  body: JSON.stringify({ model: "gpt-4o", messages: [...] }),
});

Features: caching, rate limiting, fallbacks, analytics, logging.

4. Edge Deployment

Models run on Cloudflare's GPU fleet across 300+ cities:

  • P50 latency: < 50ms for embeddings
  • Auto-scaling: 0 to millions of requests
  • No cold starts for popular models

5. Pay-Per-Request Pricing

Resource Free Tier Paid
Neurons (compute) 10,000/day $0.011 per 1,000
Vectorize queries 30M/mo $0.01 per 1M
Storage 5M vectors $0.05 per 1M

FAQ

Q: Can I use my own fine-tuned models? A: Yes, via LoRA adapters on supported base models.

Q: How does it compare to AWS Bedrock? A: Workers AI is edge-native (lower latency globally), simpler to use, and cheaper for small-to-medium workloads. Bedrock offers more enterprise models.

Q: Is there a free tier? A: Yes, 10,000 neurons/day free — enough for ~100-200 LLM requests.

🙏

Source & Thanks

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.