# Cerebras — Fastest LLM Inference for AI Agents

> Ultra-fast LLM inference at 2000+ tokens/second. Cerebras provides the fastest cloud inference for Llama and Qwen models with OpenAI-compatible API for instant AI responses.

## Install

Copy the content below into your project:

## Quick Use

```bash
pip install cerebras-cloud-sdk
```

```python
from cerebras.cloud.sdk import Cerebras

client = Cerebras(api_key="...")

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.choices[0].message.content)
```

Or use OpenAI SDK:

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.cerebras.ai/v1",
    api_key="...",
)
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello"}],
)
```

## What is Cerebras Inference?

Cerebras provides the fastest cloud LLM inference available — 2000+ tokens per second for Llama 3.3 70B, roughly 10x faster than traditional GPU inference. Built on Cerebras' custom Wafer-Scale Engine (WSE) chips, it delivers near-instant responses. OpenAI-compatible API means you can swap in Cerebras as a drop-in replacement for any OpenAI-based application.

**Answer-Ready**: Cerebras is the fastest cloud LLM inference — 2000+ tokens/sec for Llama 70B (10x faster than GPU). Custom wafer-scale chips. OpenAI-compatible API for drop-in replacement. Supports Llama 3.3, Qwen 2.5, DeepSeek. Free tier available.

**Best for**: Applications needing ultra-low latency AI responses. **Works with**: Any OpenAI-compatible tool, Claude Code (via Bifrost), LangChain. **Setup time**: Under 2 minutes.

## Speed Comparison

| Provider | Llama 3.3 70B Speed | Relative |
|----------|-------------------|----------|
| Cerebras | 2,100 tok/s | 10x |
| Groq | 750 tok/s | 3.5x |
| Together AI | 400 tok/s | 2x |
| AWS Bedrock | 200 tok/s | 1x |
| OpenAI (GPT-4o) | 150 tok/s | 0.7x |

## Supported Models

| Model | Context | Speed |
|-------|---------|-------|
| Llama 3.3 70B | 8K | 2,100 tok/s |
| Llama 3.1 8B | 8K | 4,500 tok/s |
| Qwen 2.5 32B | 8K | 2,800 tok/s |
| DeepSeek R1 | 8K | 1,800 tok/s |

## Features

### 1. OpenAI-Compatible API

```python
# Drop-in replacement — just change base_url
client = OpenAI(base_url="https://api.cerebras.ai/v1", api_key="...")
```

### 2. Streaming

```python
stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content, end="")
```

### 3. Tool Calling

```python
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
        },
    }],
)
```

## Pricing

| Tier | Requests | Price |
|------|----------|-------|
| Free | 30 req/min | $0 |
| Developer | Higher limits | Pay-as-you-go |
| Enterprise | Custom | Custom |

## FAQ

**Q: Why is it so fast?**
A: Cerebras uses custom wafer-scale chips (WSE-3) — a single chip larger than a GPU that eliminates memory bandwidth bottlenecks.

**Q: Can I use it with Claude Code?**
A: Not directly (Claude Code uses Claude). Use Bifrost CLI to route Haiku-tier requests to Cerebras for speed.

**Q: How does quality compare?**
A: Same models, same quality. Cerebras runs the exact same Llama/Qwen weights — only inference speed differs.

## Source & Thanks

> Created by [Cerebras](https://cerebras.ai).
>
> [cerebras.ai/inference](https://cerebras.ai/inference) — Fastest LLM inference

<!-- ZH -->

## 快速使用

```python
from openai import OpenAI
client = OpenAI(base_url="https://api.cerebras.ai/v1", api_key="...")
```

OpenAI 兼容 API，Llama 70B 推理速度 2000+ tok/s。

## 什么是 Cerebras？

最快的云端 LLM 推理 — Llama 70B 2000+ tok/s，比 GPU 快 10x。自研晶圆级芯片，OpenAI 兼容 API。

**一句话总结**：最快 LLM 推理，Llama 70B 2000+ tok/s（10x GPU），自研 WSE 芯片，OpenAI 兼容 API，免费层可用。

**适合人群**：需要超低延迟 AI 响应的应用。

## 速度对比

Cerebras 2100 tok/s > Groq 750 > Together 400 > Bedrock 200。

## 常见问题

**Q: 为什么这么快？**
A: 自研晶圆级芯片（WSE-3），消除内存带宽瓶颈。

**Q: 质量一样？**
A: 一样，跑的是相同的 Llama/Qwen 权重。

## 来源与致谢

> [cerebras.ai/inference](https://cerebras.ai/inference) — 最快 LLM 推理

---
Source: https://tokrepo.com/en/workflows/56284393-14c2-4bc1-9bd8-fee4b8ff3634
Author: Agent Toolkit