Esta página se muestra en inglés. Una traducción al español está en curso.
KnowledgeMay 8, 2026·5 min de lectura

Fireworks Inference — 100+ Open Models on OpenAI-Compat API

Fireworks runs Llama, Mixtral, DeepSeek, Qwen, Phi via OpenAI-compat API. Sub-second TTFT, speculative decoding on flagship models.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 15/100Stage only
Superficie agent
Cualquier agent MCP/CLI
Tipo
Knowledge
Instalación
Stage only
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install 63bacf7e-2334-4483-a208-b2b40b09383c
Introducción

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.


OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model Fireworks model ID
Llama 3.3 70B accounts/fireworks/models/llama-v3p3-70b-instruct
DeepSeek-V3 accounts/fireworks/models/deepseek-v3
DeepSeek-R1 accounts/fireworks/models/deepseek-r1
Qwen 2.5 72B accounts/fireworks/models/qwen2p5-72b-instruct
Mixtral 8×22B accounts/fireworks/models/mixtral-8x22b-instruct
Whisper v3 accounts/fireworks/models/whisper-v3
Flux dev accounts/fireworks/models/flux-1-dev-fp8

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

  • Llama 3.3 70B: $0.90 input / $0.90 output
  • DeepSeek-V3: $0.90 / $0.90
  • DeepSeek-R1: $3.00 / $8.00
  • Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.


Quick Use

  1. Get FIREWORKS_API_KEY at fireworks.ai
  2. OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)
  3. Use model='accounts/fireworks/models/llama-v3p3-70b-instruct'

Intro

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.


OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model Fireworks model ID
Llama 3.3 70B accounts/fireworks/models/llama-v3p3-70b-instruct
DeepSeek-V3 accounts/fireworks/models/deepseek-v3
DeepSeek-R1 accounts/fireworks/models/deepseek-r1
Qwen 2.5 72B accounts/fireworks/models/qwen2p5-72b-instruct
Mixtral 8×22B accounts/fireworks/models/mixtral-8x22b-instruct
Whisper v3 accounts/fireworks/models/whisper-v3
Flux dev accounts/fireworks/models/flux-1-dev-fp8

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

  • Llama 3.3 70B: $0.90 input / $0.90 output
  • DeepSeek-V3: $0.90 / $0.90
  • DeepSeek-R1: $3.00 / $8.00
  • Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.


Source & Thanks

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

🙏

Fuente y agradecimientos

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados