Cette page est affichée en anglais. Une traduction française est en cours.
KnowledgeMay 8, 2026·4 min de lecture

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

GroqCloud runs Llama 3.3 70B at 250+ tok/sec on LPU silicon. OpenAI-compatible API. Free tier, sub-second TTFT, ideal for streaming.

Groq
Groq · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 15/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Knowledge
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 8ac70a0d-0996-4fa9-a316-c9e586d54f86
Introduction

GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes.


Streaming chat completion

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Function calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

Production model lineup

Model Speed (tok/s) Context Best for
llama-3.3-70b-versatile ~280 131K Default — great quality, fast
llama-3.1-8b-instant ~750 131K Cheap, ultra-fast classifications
mixtral-8x7b-32768 ~500 32K Multilingual, code-heavy tasks
whisper-large-v3 ~166× realtime n/a Audio transcription
whisper-large-v3-turbo ~216× realtime n/a Faster transcription, slight accuracy tradeoff

Pricing (per 1M tokens, May 2026)

  • llama-3.3-70b: $0.59 input / $0.79 output
  • llama-3.1-8b: $0.05 / $0.08
  • whisper-large-v3: $0.111 per hour of audio

FAQ

Q: Why is Groq so much faster than GPU inference? A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models.

Q: Free tier limits? A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers.

Q: Does Groq run my fine-tunes? A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline.


Quick Use

  1. Sign up at console.groq.com (free)
  2. OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)
  3. Use model='llama-3.3-70b-versatile'

Intro

GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes.


Streaming chat completion

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Function calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

Production model lineup

Model Speed (tok/s) Context Best for
llama-3.3-70b-versatile ~280 131K Default — great quality, fast
llama-3.1-8b-instant ~750 131K Cheap, ultra-fast classifications
mixtral-8x7b-32768 ~500 32K Multilingual, code-heavy tasks
whisper-large-v3 ~166× realtime n/a Audio transcription
whisper-large-v3-turbo ~216× realtime n/a Faster transcription, slight accuracy tradeoff

Pricing (per 1M tokens, May 2026)

  • llama-3.3-70b: $0.59 input / $0.79 output
  • llama-3.1-8b: $0.05 / $0.08
  • whisper-large-v3: $0.111 per hour of audio

FAQ

Q: Why is Groq so much faster than GPU inference? A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models.

Q: Free tier limits? A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers.

Q: Does Groq run my fine-tunes? A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline.


Source & Thanks

Built by Groq. Docs at console.groq.com/docs.

groq/groq-python — official SDK

🙏

Source et remerciements

Built by Groq. Docs at console.groq.com/docs.

groq/groq-python — official SDK

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires