How do I install GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

Name: GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API
Author: Groq

from openai import OpenAI client = OpenAI( base_url="https://api.groq.com/openai/v1", api_key=os.environ["GROQ_API_KEY"], ) stream = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True)

tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}, }, }] resp = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": "What's the weather in Tokyo?"}], tools=tools, ) print(resp.choices[0].message.tool_calls)

Model

Speed (tok/s)

Context

Best for

llama-3.3-70b-versatile

~280

131K

Default — great quality, fast

llama-3.1-8b-instant

~750

131K

Cheap, ultra-fast classifications

mixtral-8x7b-32768

~500

32K

Multilingual, code-heavy tasks

whisper-large-v3

~166× realtime

n/a

Audio transcription

whisper-large-v3-turbo

~216× realtime

n/a

Faster transcription, slight accuracy tradeoff

Quick Use

Sign up at console.groq.com (free)
OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)
Use model='llama-3.3-70b-versatile'

Intro

GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes.

Streaming chat completion

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Function calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

Production model lineup

Model	Speed (tok/s)	Context	Best for
`llama-3.3-70b-versatile`	~280	131K	Default — great quality, fast
`llama-3.1-8b-instant`	~750	131K	Cheap, ultra-fast classifications
`mixtral-8x7b-32768`	~500	32K	Multilingual, code-heavy tasks
`whisper-large-v3`	~166× realtime	n/a	Audio transcription
`whisper-large-v3-turbo`	~216× realtime	n/a	Faster transcription, slight accuracy tradeoff

Pricing (per 1M tokens, May 2026)

llama-3.3-70b: $0.59 input / $0.79 output
llama-3.1-8b: $0.05 / $0.08
whisper-large-v3: $0.111 per hour of audio

FAQ

Q: Why is Groq so much faster than GPU inference? A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models.

Q: Free tier limits? A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers.

Q: Does Groq run my fine-tunes? A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline.

Source & Thanks

Built by Groq. Docs at console.groq.com/docs.

groq/groq-python — official SDK

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

Este activo puede ser leído e instalado directamente por agents

Streaming chat completion

Function calling

Production model lineup

Pricing (per 1M tokens, May 2026)

FAQ

Quick Use

Intro

Streaming chat completion

Function calling

Production model lineup

Pricing (per 1M tokens, May 2026)

FAQ

Source & Thanks

Fuente y agradecimientos

Discusión

Activos relacionados

xAI Grok API Quickstart — OpenAI-Compatible Frontier Model

Phoenix Tracing Quickstart — OpenInference Tracer Setup

Weave — Trace and Debug LLM Apps

Statewave — Memory Runtime for AI Agents (API + SDKs)