Esta página se muestra en inglés. Una traducción al español está en curso.
KnowledgeMay 8, 2026·4 min de lectura

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

GroqCloud runs Llama 3.3 70B at 250+ tok/sec on LPU silicon. OpenAI-compatible API. Free tier, sub-second TTFT, ideal for streaming.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 15/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
Knowledge
Instalación
Stage only
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install 8ac70a0d-0996-4fa9-a316-c9e586d54f86
Introducción

GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes.


Streaming chat completion

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Function calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

Production model lineup

Model Speed (tok/s) Context Best for
llama-3.3-70b-versatile ~280 131K Default — great quality, fast
llama-3.1-8b-instant ~750 131K Cheap, ultra-fast classifications
mixtral-8x7b-32768 ~500 32K Multilingual, code-heavy tasks
whisper-large-v3 ~166× realtime n/a Audio transcription
whisper-large-v3-turbo ~216× realtime n/a Faster transcription, slight accuracy tradeoff

Pricing (per 1M tokens, May 2026)

  • llama-3.3-70b: $0.59 input / $0.79 output
  • llama-3.1-8b: $0.05 / $0.08
  • whisper-large-v3: $0.111 per hour of audio

FAQ

Q: Why is Groq so much faster than GPU inference? A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models.

Q: Free tier limits? A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers.

Q: Does Groq run my fine-tunes? A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline.


Quick Use

  1. Sign up at console.groq.com (free)
  2. OpenAI(base_url='https://api.groq.com/openai/v1', api_key=GROQ_KEY)
  3. Use model='llama-3.3-70b-versatile'

Intro

GroqCloud serves open-weight models (Llama 3.3 70B, Llama 3.1 8B/70B, Mixtral 8×7B, Gemma 2, Whisper) on Groq's LPU custom silicon — 250+ tokens/sec on Llama 3.3 70B and sub-200ms time-to-first-token. The API is OpenAI-compatible: change base URL to api.groq.com/openai/v1 and you're done. Best for: streaming chat agents where typing speed matters, voice agents (Whisper STT under 200ms), real-time tools where slow inference kills UX. Works with: openai-python, openai-node, LangChain, LlamaIndex, Vercel AI SDK. Setup time: 2 minutes.


Streaming chat completion

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain how an LPU differs from a GPU for inference"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Function calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}},
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)
print(resp.choices[0].message.tool_calls)

Production model lineup

Model Speed (tok/s) Context Best for
llama-3.3-70b-versatile ~280 131K Default — great quality, fast
llama-3.1-8b-instant ~750 131K Cheap, ultra-fast classifications
mixtral-8x7b-32768 ~500 32K Multilingual, code-heavy tasks
whisper-large-v3 ~166× realtime n/a Audio transcription
whisper-large-v3-turbo ~216× realtime n/a Faster transcription, slight accuracy tradeoff

Pricing (per 1M tokens, May 2026)

  • llama-3.3-70b: $0.59 input / $0.79 output
  • llama-3.1-8b: $0.05 / $0.08
  • whisper-large-v3: $0.111 per hour of audio

FAQ

Q: Why is Groq so much faster than GPU inference? A: LPU (Language Processing Unit) silicon is purpose-built for transformer inference — sequential token decode runs at memory-bandwidth-limited speed without GPU's batching tradeoffs. Result: 5-10× faster TTFT and steady-state throughput on the same models.

Q: Free tier limits? A: Yes — generous for dev/testing: ~30 requests/minute and ~14,400 requests/day per model. Production traffic uses paid tier with much higher limits. Check console.groq.com for current numbers.

Q: Does Groq run my fine-tunes? A: Not currently — only the model catalog Groq publishes. If you need a custom fine-tune at Groq speed, options are: (1) use prompt engineering on Llama 3.3 70B; (2) deploy on Together AI / Fireworks which support LoRA on similar speeds. Groq has hinted at fine-tune support but no public timeline.


Source & Thanks

Built by Groq. Docs at console.groq.com/docs.

groq/groq-python — official SDK

🙏

Fuente y agradecimientos

Built by Groq. Docs at console.groq.com/docs.

groq/groq-python — official SDK

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados