Quick Use
- Sign up at console.groq.com (free)
OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)- Use
model='llama-3.3-70b-versatile'
Intro
GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes.
Streaming chat completion
from openai import OpenAI
client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key=os.environ["GROQ_API_KEY"],
)
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)Function calling
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
},
}]
resp = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
)
print(resp.choices[0].message.tool_calls)Production model lineup
| Model | Speed (tok/s) | Context | Best for |
|---|---|---|---|
llama-3.3-70b-versatile |
~280 | 131K | Default — great quality, fast |
llama-3.1-8b-instant |
~750 | 131K | Cheap, ultra-fast classifications |
mixtral-8x7b-32768 |
~500 | 32K | Multilingual, code-heavy tasks |
whisper-large-v3 |
~166× realtime | n/a | Audio transcription |
whisper-large-v3-turbo |
~216× realtime | n/a | Faster transcription, slight accuracy tradeoff |
Pricing (per 1M tokens, May 2026)
- llama-3.3-70b: $0.59 input / $0.79 output
- llama-3.1-8b: $0.05 / $0.08
- whisper-large-v3: $0.111 per hour of audio
FAQ
Q: Why is Groq so much faster than GPU inference? A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models.
Q: Free tier limits? A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers.
Q: Does Groq run my fine-tunes? A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline.
Source & Thanks
Built by Groq. Docs at console.groq.com/docs.
groq/groq-python — official SDK