# GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

> GroqCloud runs Llama 3.3 70B at 250+ tok/sec on LPU silicon. OpenAI-compatible API. Free tier, sub-second TTFT, ideal for streaming.

## Install

Copy the content below into your project:

## Quick Use

1. Sign up at console.groq.com (free)
2. `OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)`
3. Use `model='llama-3.3-70b-versatile'`

---

## Intro

GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes.

---

### Streaming chat completion

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
```

### Function calling

```python
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)
```

### Production model lineup

| Model | Speed (tok/s) | Context | Best for |
|---|---|---|---|
| `llama-3.3-70b-versatile` | ~280 | 131K | Default — great quality, fast |
| `llama-3.1-8b-instant` | ~750 | 131K | Cheap, ultra-fast classifications |
| `mixtral-8x7b-32768` | ~500 | 32K | Multilingual, code-heavy tasks |
| `whisper-large-v3` | ~166× realtime | n/a | Audio transcription |
| `whisper-large-v3-turbo` | ~216× realtime | n/a | Faster transcription, slight accuracy tradeoff |

### Pricing (per 1M tokens, May 2026)

- llama-3.3-70b: $0.59 input / $0.79 output
- llama-3.1-8b: $0.05 / $0.08
- whisper-large-v3: $0.111 per hour of audio

---

### FAQ

**Q: Why is Groq so much faster than GPU inference?**
A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models.

**Q: Free tier limits?**
A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers.

**Q: Does Groq run my fine-tunes?**
A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline.

---

## Source & Thanks

> Built by [Groq](https://github.com/groq). Docs at [console.groq.com/docs](https://console.groq.com/docs).
>
> [groq/groq-python](https://github.com/groq/groq-python) — official SDK

---

<!-- ZH -->

## 快速使用

1. 在 console.groq.com 注册（免费）
2. `OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)`
3. 用 `model='llama-3.3-70b-versatile'`

---

## 简介

GroqCloud 在 Groq 自研 LPU 芯片上跑开源权重模型（Llama 3.3 70B、Llama 3.1 8B/70B、Mixtral 8×7B、Gemma 2、Whisper）—— Llama 3.3 70B 上 250+ tokens/秒，首 token 时间 (TTFT) <200ms。API 跟 OpenAI 兼容：base URL 改成 api.groq.com/openai/v1 就完事。适合打字速度重要的流式 chat agent、语音 agent（Whisper STT <200ms）、实时工具（慢推理毁 UX）。兼容 openai-python、openai-node、LangChain、LlamaIndex、Vercel AI SDK。装机时间 2 分钟。

---

### 流式 chat completion

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "解释 LPU 推理跟 GPU 的区别"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
```

### Function calling

```python
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "拿城市当前天气",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "东京天气怎么样？"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)
```

### 生产模型阵容

| 模型 | 速度（tok/秒） | 上下文 | 最佳用途 |
|---|---|---|---|
| `llama-3.3-70b-versatile` | ~280 | 131K | 默认 —— 质量好、快 |
| `llama-3.1-8b-instant` | ~750 | 131K | 便宜、超快分类 |
| `mixtral-8x7b-32768` | ~500 | 32K | 多语言、代码重任务 |
| `whisper-large-v3` | ~166× 实时 | 不适用 | 音频转录 |
| `whisper-large-v3-turbo` | ~216× 实时 | 不适用 | 更快转录，精度小让步 |

### 价格（每百万 token，2026 年 5 月）

- llama-3.3-70b：输入 $0.59 / 输出 $0.79
- llama-3.1-8b：$0.05 / $0.08
- whisper-large-v3：每小时音频 $0.111

---

### FAQ

**Q: 为啥 Groq 比 GPU 推理快那么多？**
A: LPU（语言处理单元）芯片专为 transformer 推理设计 —— 顺序 token 解码以内存带宽极限速度跑，没 GPU batching 的折中。结果：同样模型上 TTFT 和稳态吞吐都快 5-10×。

**Q: 免费额度限制？**
A: 有 —— 开发/测试够用：每模型每分钟约 30 请求、每日约 14,400 请求。生产流量用付费档限额高很多。看 console.groq.com 当前数字。

**Q: Groq 能跑我的微调吗？**
A: 目前不能 —— 只跑 Groq 公布的模型目录。要 Groq 速度的自定义微调，选：(1) 在 Llama 3.3 70B 上做 prompt 工程；(2) 部署到 Together AI / Fireworks（同等速度支持 LoRA）。Groq 暗示过微调支持但没公开时间表。

---

## 来源与感谢

> Built by [Groq](https://github.com/groq). Docs at [console.groq.com/docs](https://console.groq.com/docs).
>
> [groq/groq-python](https://github.com/groq/groq-python) — official SDK


---
Source: https://tokrepo.com/en/workflows/groqcloud-quickstart-250-tokens-sec-openai-compat-api
Author: Groq