KnowledgeMay 8, 2026·5 min read

Fireworks Inference — 100+ Open Models on OpenAI-Compat API

Fireworks runs Llama, Mixtral, DeepSeek, Qwen, Phi via OpenAI-compat API. Sub-second TTFT, speculative decoding on flagship models.

Agent ready

Safe staging for this asset

This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.

Stage only · 27/100Policy: stage
Agent surface
Any MCP/CLI agent
Kind
Knowledge
Install
Stage only
Trust
Trust: Community
Entrypoint
Asset
Safe staging command
npx -y tokrepo@latest install 63bacf7e-2334-4483-a208-b2b40b09383c --target codex

Stages files first; activation requires review of the staged README and plan.

Intro

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.


OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model Fireworks model ID
Llama 3.3 70B accounts/fireworks/models/llama-v3p3-70b-instruct
DeepSeek-V3 accounts/fireworks/models/deepseek-v3
DeepSeek-R1 accounts/fireworks/models/deepseek-r1
Qwen 2.5 72B accounts/fireworks/models/qwen2p5-72b-instruct
Mixtral 8×22B accounts/fireworks/models/mixtral-8x22b-instruct
Whisper v3 accounts/fireworks/models/whisper-v3
Flux dev accounts/fireworks/models/flux-1-dev-fp8

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

  • Llama 3.3 70B: $0.90 input / $0.90 output
  • DeepSeek-V3: $0.90 / $0.90
  • DeepSeek-R1: $3.00 / $8.00
  • Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.


Quick Use

  1. Get FIREWORKS_API_KEY at fireworks.ai
  2. OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)
  3. Use model='accounts/fireworks/models/llama-v3p3-70b-instruct'

Intro

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.


OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model Fireworks model ID
Llama 3.3 70B accounts/fireworks/models/llama-v3p3-70b-instruct
DeepSeek-V3 accounts/fireworks/models/deepseek-v3
DeepSeek-R1 accounts/fireworks/models/deepseek-r1
Qwen 2.5 72B accounts/fireworks/models/qwen2p5-72b-instruct
Mixtral 8×22B accounts/fireworks/models/mixtral-8x22b-instruct
Whisper v3 accounts/fireworks/models/whisper-v3
Flux dev accounts/fireworks/models/flux-1-dev-fp8

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

  • Llama 3.3 70B: $0.90 input / $0.90 output
  • DeepSeek-V3: $0.90 / $0.90
  • DeepSeek-R1: $3.00 / $8.00
  • Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.


Source & Thanks

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

🙏

Source & Thanks

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets