# Cerebras — Fastest LLM Inference for AI Agents > Ultra-fast LLM inference at 2000+ tokens/second. Cerebras provides the fastest cloud inference for Llama and Qwen models with OpenAI-compatible API for instant AI responses. ## Install Copy the content below into your project: ## Quick Use ```bash pip install cerebras-cloud-sdk ``` ```python from cerebras.cloud.sdk import Cerebras client = Cerebras(api_key="...") response = client.chat.completions.create( model="llama-3.3-70b", messages=[{"role": "user", "content": "Explain quantum computing"}], ) print(response.choices[0].message.content) ``` Or use OpenAI SDK: ```python from openai import OpenAI client = OpenAI( base_url="https://api.cerebras.ai/v1", api_key="...", ) response = client.chat.completions.create( model="llama-3.3-70b", messages=[{"role": "user", "content": "Hello"}], ) ``` ## What is Cerebras Inference? Cerebras provides the fastest cloud LLM inference available — 2000+ tokens per second for Llama 3.3 70B, roughly 10x faster than traditional GPU inference. Built on Cerebras' custom Wafer-Scale Engine (WSE) chips, it delivers near-instant responses. OpenAI-compatible API means you can swap in Cerebras as a drop-in replacement for any OpenAI-based application. **Answer-Ready**: Cerebras is the fastest cloud LLM inference — 2000+ tokens/sec for Llama 70B (10x faster than GPU). Custom wafer-scale chips. OpenAI-compatible API for drop-in replacement. Supports Llama 3.3, Qwen 2.5, DeepSeek. Free tier available. **Best for**: Applications needing ultra-low latency AI responses. **Works with**: Any OpenAI-compatible tool, Claude Code (via Bifrost), LangChain. **Setup time**: Under 2 minutes. ## Speed Comparison | Provider | Llama 3.3 70B Speed | Relative | |----------|-------------------|----------| | Cerebras | 2,100 tok/s | 10x | | Groq | 750 tok/s | 3.5x | | Together AI | 400 tok/s | 2x | | AWS Bedrock | 200 tok/s | 1x | | OpenAI (GPT-4o) | 150 tok/s | 0.7x | ## Supported Models | Model | Context | Speed | |-------|---------|-------| | Llama 3.3 70B | 8K | 2,100 tok/s | | Llama 3.1 8B | 8K | 4,500 tok/s | | Qwen 2.5 32B | 8K | 2,800 tok/s | | DeepSeek R1 | 8K | 1,800 tok/s | ## Features ### 1. OpenAI-Compatible API ```python # Drop-in replacement — just change base_url client = OpenAI(base_url="https://api.cerebras.ai/v1", api_key="...") ``` ### 2. Streaming ```python stream = client.chat.completions.create( model="llama-3.3-70b", messages=[{"role": "user", "content": "Write a story"}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content, end="") ``` ### 3. Tool Calling ```python response = client.chat.completions.create( model="llama-3.3-70b", messages=[{"role": "user", "content": "What's the weather?"}], tools=[{ "type": "function", "function": { "name": "get_weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}, }, }], ) ``` ## Pricing | Tier | Requests | Price | |------|----------|-------| | Free | 30 req/min | $0 | | Developer | Higher limits | Pay-as-you-go | | Enterprise | Custom | Custom | ## FAQ **Q: Why is it so fast?** A: Cerebras uses custom wafer-scale chips (WSE-3) — a single chip larger than a GPU that eliminates memory bandwidth bottlenecks. **Q: Can I use it with Claude Code?** A: Not directly (Claude Code uses Claude). Use Bifrost CLI to route Haiku-tier requests to Cerebras for speed. **Q: How does quality compare?** A: Same models, same quality. Cerebras runs the exact same Llama/Qwen weights — only inference speed differs. ## Source & Thanks > Created by [Cerebras](https://cerebras.ai). > > [cerebras.ai/inference](https://cerebras.ai/inference) — Fastest LLM inference ## 快速使用 ```python from openai import OpenAI client = OpenAI(base_url="https://api.cerebras.ai/v1", api_key="...") ``` OpenAI 兼容 API,Llama 70B 推理速度 2000+ tok/s。 ## 什么是 Cerebras? 最快的云端 LLM 推理 — Llama 70B 2000+ tok/s,比 GPU 快 10x。自研晶圆级芯片,OpenAI 兼容 API。 **一句话总结**:最快 LLM 推理,Llama 70B 2000+ tok/s(10x GPU),自研 WSE 芯片,OpenAI 兼容 API,免费层可用。 **适合人群**:需要超低延迟 AI 响应的应用。 ## 速度对比 Cerebras 2100 tok/s > Groq 750 > Together 400 > Bedrock 200。 ## 常见问题 **Q: 为什么这么快?** A: 自研晶圆级芯片(WSE-3),消除内存带宽瓶颈。 **Q: 质量一样?** A: 一样,跑的是相同的 Llama/Qwen 权重。 ## 来源与致谢 > [cerebras.ai/inference](https://cerebras.ai/inference) — 最快 LLM 推理 --- Source: https://tokrepo.com/en/workflows/56284393-14c2-4bc1-9bd8-fee4b8ff3634 Author: Agent Toolkit