Cette page est affichée en anglais. Une traduction française est en cours.
KnowledgeMay 8, 2026·5 min de lecture

Fireworks Inference — 100+ Open Models on OpenAI-Compat API

Fireworks runs Llama, Mixtral, DeepSeek, Qwen, Phi via OpenAI-compat API. Sub-second TTFT, speculative decoding on flagship models.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 15/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Knowledge
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 63bacf7e-2334-4483-a208-b2b40b09383c
Introduction

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.


OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model Fireworks model ID
Llama 3.3 70B accounts/fireworks/models/llama-v3p3-70b-instruct
DeepSeek-V3 accounts/fireworks/models/deepseek-v3
DeepSeek-R1 accounts/fireworks/models/deepseek-r1
Qwen 2.5 72B accounts/fireworks/models/qwen2p5-72b-instruct
Mixtral 8×22B accounts/fireworks/models/mixtral-8x22b-instruct
Whisper v3 accounts/fireworks/models/whisper-v3
Flux dev accounts/fireworks/models/flux-1-dev-fp8

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

  • Llama 3.3 70B: $0.90 input / $0.90 output
  • DeepSeek-V3: $0.90 / $0.90
  • DeepSeek-R1: $3.00 / $8.00
  • Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.


Quick Use

  1. Get FIREWORKS_API_KEY at fireworks.ai
  2. OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)
  3. Use model='accounts/fireworks/models/llama-v3p3-70b-instruct'

Intro

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.


OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model Fireworks model ID
Llama 3.3 70B accounts/fireworks/models/llama-v3p3-70b-instruct
DeepSeek-V3 accounts/fireworks/models/deepseek-v3
DeepSeek-R1 accounts/fireworks/models/deepseek-r1
Qwen 2.5 72B accounts/fireworks/models/qwen2p5-72b-instruct
Mixtral 8×22B accounts/fireworks/models/mixtral-8x22b-instruct
Whisper v3 accounts/fireworks/models/whisper-v3
Flux dev accounts/fireworks/models/flux-1-dev-fp8

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

  • Llama 3.3 70B: $0.90 input / $0.90 output
  • DeepSeek-V3: $0.90 / $0.90
  • DeepSeek-R1: $3.00 / $8.00
  • Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.


Source & Thanks

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

🙏

Source et remerciements

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires