# GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API > GroqCloud runs Llama 3.3 70B at 250+ tok/sec on LPU silicon. OpenAI-compatible API. Free tier, sub-second TTFT, ideal for streaming. ## Install Copy the content below into your project: ## Quick Use 1. Sign up at console.groq.com (free) 2. `OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)` 3. Use `model='llama-3.3-70b-versatile'` --- ## Intro GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes. --- ### Streaming chat completion ```python from openai import OpenAI client = OpenAI( base_url="https://api.groq.com/openai/v1", api_key=os.environ["GROQ_API_KEY"], ) stream = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True) ``` ### Function calling ```python tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}, }, }] resp = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": "What's the weather in Tokyo?"}], tools=tools, ) print(resp.choices[0].message.tool_calls) ``` ### Production model lineup | Model | Speed (tok/s) | Context | Best for | |---|---|---|---| | `llama-3.3-70b-versatile` | ~280 | 131K | Default — great quality, fast | | `llama-3.1-8b-instant` | ~750 | 131K | Cheap, ultra-fast classifications | | `mixtral-8x7b-32768` | ~500 | 32K | Multilingual, code-heavy tasks | | `whisper-large-v3` | ~166× realtime | n/a | Audio transcription | | `whisper-large-v3-turbo` | ~216× realtime | n/a | Faster transcription, slight accuracy tradeoff | ### Pricing (per 1M tokens, May 2026) - llama-3.3-70b: $0.59 input / $0.79 output - llama-3.1-8b: $0.05 / $0.08 - whisper-large-v3: $0.111 per hour of audio --- ### FAQ **Q: Why is Groq so much faster than GPU inference?** A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models. **Q: Free tier limits?** A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers. **Q: Does Groq run my fine-tunes?** A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline. --- ## Source & Thanks > Built by [Groq](https://github.com/groq). Docs at [console.groq.com/docs](https://console.groq.com/docs). > > [groq/groq-python](https://github.com/groq/groq-python) — official SDK --- ## 快速使用 1. 在 console.groq.com 注册(免费) 2. `OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)` 3. 用 `model='llama-3.3-70b-versatile'` --- ## 简介 GroqCloud 在 Groq 自研 LPU 芯片上跑开源权重模型(Llama 3.3 70B、Llama 3.1 8B/70B、Mixtral 8×7B、Gemma 2、Whisper)—— Llama 3.3 70B 上 250+ tokens/秒,首 token 时间 (TTFT) <200ms。API 跟 OpenAI 兼容:base URL 改成 api.groq.com/openai/v1 就完事。适合打字速度重要的流式 chat agent、语音 agent(Whisper STT <200ms)、实时工具(慢推理毁 UX)。兼容 openai-python、openai-node、LangChain、LlamaIndex、Vercel AI SDK。装机时间 2 分钟。 --- ### 流式 chat completion ```python from openai import OpenAI client = OpenAI( base_url="https://api.groq.com/openai/v1", api_key=os.environ["GROQ_API_KEY"], ) stream = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": "解释 LPU 推理跟 GPU 的区别"}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True) ``` ### Function calling ```python tools = [{ "type": "function", "function": { "name": "get_weather", "description": "拿城市当前天气", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}, }, }] resp = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": "东京天气怎么样?"}], tools=tools, ) print(resp.choices[0].message.tool_calls) ``` ### 生产模型阵容 | 模型 | 速度(tok/秒) | 上下文 | 最佳用途 | |---|---|---|---| | `llama-3.3-70b-versatile` | ~280 | 131K | 默认 —— 质量好、快 | | `llama-3.1-8b-instant` | ~750 | 131K | 便宜、超快分类 | | `mixtral-8x7b-32768` | ~500 | 32K | 多语言、代码重任务 | | `whisper-large-v3` | ~166× 实时 | 不适用 | 音频转录 | | `whisper-large-v3-turbo` | ~216× 实时 | 不适用 | 更快转录,精度小让步 | ### 价格(每百万 token,2026 年 5 月) - llama-3.3-70b:输入 $0.59 / 输出 $0.79 - llama-3.1-8b:$0.05 / $0.08 - whisper-large-v3:每小时音频 $0.111 --- ### FAQ **Q: 为啥 Groq 比 GPU 推理快那么多?** A: LPU(语言处理单元)芯片专为 transformer 推理设计 —— 顺序 token 解码以内存带宽极限速度跑,没 GPU batching 的折中。结果:同样模型上 TTFT 和稳态吞吐都快 5-10×。 **Q: 免费额度限制?** A: 有 —— 开发/测试够用:每模型每分钟约 30 请求、每日约 14,400 请求。生产流量用付费档限额高很多。看 console.groq.com 当前数字。 **Q: Groq 能跑我的微调吗?** A: 目前不能 —— 只跑 Groq 公布的模型目录。要 Groq 速度的自定义微调,选:(1) 在 Llama 3.3 70B 上做 prompt 工程;(2) 部署到 Together AI / Fireworks(同等速度支持 LoRA)。Groq 暗示过微调支持但没公开时间表。 --- ## 来源与感谢 > Built by [Groq](https://github.com/groq). Docs at [console.groq.com/docs](https://console.groq.com/docs). > > [groq/groq-python](https://github.com/groq/groq-python) — official SDK --- Source: https://tokrepo.com/en/workflows/groqcloud-quickstart-250-tokens-sec-openai-compat-api Author: Groq