How do I install GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

Name: GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API
Author: Groq

简介

GroqCloud 在 Groq 自研 LPU 芯片上跑开源权重模型（Llama 3.3 70B、Llama 3.1 8B/70B、Mixtral 8×7B、Gemma 2、Whisper）—— Llama 3.3 70B 上 250+ tokens/秒，首 token 时间 (TTFT) <200ms。API 跟 OpenAI 兼容：base URL 改成 api.groq.com/openai/v1 就完事。适合打字速度重要的流式 chat agent、语音 agent（Whisper STT <200ms）、实时工具（慢推理毁 UX）。兼容 openai-python、openai-node、LangChain、LlamaIndex、Vercel AI SDK。装机时间 2 分钟。

流式 chat completion

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "解释 LPU 推理跟 GPU 的区别"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Function calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "拿城市当前天气",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "东京天气怎么样？"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

生产模型阵容

模型	速度（tok/秒）	上下文	最佳用途
`llama-3.3-70b-versatile`	~280	131K	默认 —— 质量好、快
`llama-3.1-8b-instant`	~750	131K	便宜、超快分类
`mixtral-8x7b-32768`	~500	32K	多语言、代码重任务
`whisper-large-v3`	~166× 实时	不适用	音频转录
`whisper-large-v3-turbo`	~216× 实时	不适用	更快转录，精度小让步

价格（每百万 token，2026 年 5 月）

llama-3.3-70b：输入 $0.59 / 输出 $0.79
llama-3.1-8b：$0.05 / $0.08
whisper-large-v3：每小时音频 $0.111

FAQ

Q: 为啥 Groq 比 GPU 推理快那么多？ A: LPU（语言处理单元）芯片专为 transformer 推理设计 —— 顺序 token 解码以内存带宽极限速度跑，没 GPU batching 的折中。结果：同样模型上 TTFT 和稳态吞吐都快 5-10×。

Q: 免费额度限制？ A: 有 —— 开发/测试够用：每模型每分钟约 30 请求、每日约 14,400 请求。生产流量用付费档限额高很多。看 console.groq.com 当前数字。

Q: Groq 能跑我的微调吗？ A: 目前不能 —— 只跑 Groq 公布的模型目录。要 Groq 速度的自定义微调，选：(1) 在 Llama 3.3 70B 上做 prompt 工程；(2) 部署到 Together AI / Fireworks（同等速度支持 LoRA）。Groq 暗示过微调支持但没公开时间表。

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

这个资产可以被 Agent 直接读取和安装

简介

流式 chat completion

Function calling

生产模型阵容

价格（每百万 token，2026 年 5 月）

FAQ

来源与感谢

讨论

相关资产

xAI Grok API Quickstart — OpenAI-Compatible Frontier Model

Phoenix Tracing Quickstart — OpenInference Tracer Setup

Weave — Trace and Debug LLM Apps

Statewave — Memory Runtime for AI Agents (API + SDKs)