Quick Use
- Get FIREWORKS_API_KEY at fireworks.ai
OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)- Use
model='accounts/fireworks/models/llama-v3p3-70b-instruct'
Intro
Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.
OpenAI-compatible client
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key=os.environ["FIREWORKS_API_KEY"],
)
resp = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p3-70b-instruct",
messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)Production model paths
| Model | Fireworks model ID |
|---|---|
| Llama 3.3 70B | accounts/fireworks/models/llama-v3p3-70b-instruct |
| DeepSeek-V3 | accounts/fireworks/models/deepseek-v3 |
| DeepSeek-R1 | accounts/fireworks/models/deepseek-r1 |
| Qwen 2.5 72B | accounts/fireworks/models/qwen2p5-72b-instruct |
| Mixtral 8×22B | accounts/fireworks/models/mixtral-8x22b-instruct |
| Whisper v3 | accounts/fireworks/models/whisper-v3 |
| Flux dev | accounts/fireworks/models/flux-1-dev-fp8 |
Speculative decoding (flagship throughput boost)
Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.
Image generation
import requests
r = requests.post(
"https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)Pricing snapshot (per 1M tokens, May 2026)
- Llama 3.3 70B: $0.90 input / $0.90 output
- DeepSeek-V3: $0.90 / $0.90
- DeepSeek-R1: $3.00 / $8.00
- Qwen 2.5 72B: $0.90 / $0.90
FAQ
Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.
Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.
Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.
Source & Thanks
Built by Fireworks AI. Docs at docs.fireworks.ai.
fw-ai/forge — open SDKs and tooling