# Fireworks Inference — 100+ Open Models on OpenAI-Compat API > Fireworks runs Llama, Mixtral, DeepSeek, Qwen, Phi via OpenAI-compat API. Sub-second TTFT, speculative decoding on flagship models. ## Install Copy the content below into your project: ## Quick Use 1. Get FIREWORKS_API_KEY at fireworks.ai 2. `OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)` 3. Use `model='accounts/fireworks/models/llama-v3p3-70b-instruct'` --- ## Intro Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes. --- ### OpenAI-compatible client ```python from openai import OpenAI client = OpenAI( base_url="https://api.fireworks.ai/inference/v1", api_key=os.environ["FIREWORKS_API_KEY"], ) resp = client.chat.completions.create( model="accounts/fireworks/models/llama-v3p3-70b-instruct", messages=[{"role": "user", "content": "Write a haiku about open-source AI"}], ) print(resp.choices[0].message.content) ``` ### Production model paths | Model | Fireworks model ID | |---|---| | Llama 3.3 70B | `accounts/fireworks/models/llama-v3p3-70b-instruct` | | DeepSeek-V3 | `accounts/fireworks/models/deepseek-v3` | | DeepSeek-R1 | `accounts/fireworks/models/deepseek-r1` | | Qwen 2.5 72B | `accounts/fireworks/models/qwen2p5-72b-instruct` | | Mixtral 8×22B | `accounts/fireworks/models/mixtral-8x22b-instruct` | | Whisper v3 | `accounts/fireworks/models/whisper-v3` | | Flux dev | `accounts/fireworks/models/flux-1-dev-fp8` | ### Speculative decoding (flagship throughput boost) Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed. ### Image generation ```python import requests r = requests.post( "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8", headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"}, json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30}, ) open("output.jpg", "wb").write(r.content) ``` ### Pricing snapshot (per 1M tokens, May 2026) - Llama 3.3 70B: $0.90 input / $0.90 output - DeepSeek-V3: $0.90 / $0.90 - DeepSeek-R1: $3.00 / $8.00 - Qwen 2.5 72B: $0.90 / $0.90 --- ### FAQ **Q: Fireworks vs Together AI vs Groq?** A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants. **Q: Does Fireworks support fine-tunes?** A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee. **Q: How do I monitor cost and latency?** A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat. --- ## Source & Thanks > Built by [Fireworks AI](https://github.com/fw-ai). Docs at [docs.fireworks.ai](https://docs.fireworks.ai). > > [fw-ai/forge](https://github.com/fw-ai) — open SDKs and tooling --- ## 快速使用 1. 在 fireworks.ai 拿 FIREWORKS_API_KEY 2. `OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)` 3. 用 `model='accounts/fireworks/models/llama-v3p3-70b-instruct'` --- ## 简介 Fireworks AI 是 100+ 开源权重模型的无服务器推理平台 —— Llama 3.3、Mixtral、DeepSeek-V3、Qwen 2.5、Phi 4,加图像(Flux、SDXL)和音频(Whisper)模型。API 在 api.fireworks.ai/inference/v1,跟 OpenAI 兼容。首 token 延迟 <1 秒,旗舰模型上推测解码带 2-4× 吞吐。适合需要快速 OSS 推理的生产应用、因成本或合规离开 OpenAI 的团队、任何要规模化跑 DeepSeek-V3 或 Llama 3.3 的人。兼容 openai-python、openai-node、LangChain、LlamaIndex。装机时间 3 分钟。 --- ### OpenAI 兼容客户端 ```python from openai import OpenAI client = OpenAI( base_url="https://api.fireworks.ai/inference/v1", api_key=os.environ["FIREWORKS_API_KEY"], ) resp = client.chat.completions.create( model="accounts/fireworks/models/llama-v3p3-70b-instruct", messages=[{"role": "user", "content": "写一首关于开源 AI 的俳句"}], ) print(resp.choices[0].message.content) ``` ### 生产模型路径 | 模型 | Fireworks model ID | |---|---| | Llama 3.3 70B | `accounts/fireworks/models/llama-v3p3-70b-instruct` | | DeepSeek-V3 | `accounts/fireworks/models/deepseek-v3` | | DeepSeek-R1 | `accounts/fireworks/models/deepseek-r1` | | Qwen 2.5 72B | `accounts/fireworks/models/qwen2p5-72b-instruct` | | Mixtral 8×22B | `accounts/fireworks/models/mixtral-8x22b-instruct` | | Whisper v3 | `accounts/fireworks/models/whisper-v3` | | Flux dev | `accounts/fireworks/models/flux-1-dev-fp8` | ### 推测解码(旗舰吞吐加速) Llama 3.3 70B 和 DeepSeek-V3 默认开推测解码 —— 小 draft 模型提议 token,大模型并行验证。长生成净吞吐比朴素解码快 2-4×。代码不用改。 ### 图像生成 ```python import requests r = requests.post( "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8", headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"}, json={"prompt": "黄昏赛博朋克东京天际线,照片级写实", "aspect_ratio": "16:9", "steps": 30}, ) open("output.jpg", "wb").write(r.content) ``` ### 价格快照(每百万 token,2026 年 5 月) - Llama 3.3 70B:输入 $0.90 / 输出 $0.90 - DeepSeek-V3:$0.90 / $0.90 - DeepSeek-R1:$3.00 / $8.00 - Qwen 2.5 72B:$0.90 / $0.90 --- ### FAQ **Q: Fireworks vs Together AI vs Groq?** A: Groq chat 最快(LPU 芯片,Llama 3.3 ~280 tok/秒)但模型目录小。Fireworks 和 Together 都提供 100+ 模型、价格相似;Fireworks 吞吐和图像生成更强,Together 长上下文 Llama 变体更多。 **Q: Fireworks 支持微调吗?** A: 支持 —— Llama / Qwen / Mixtral 上的无服务器 LoRA 微调。通过 Firectl CLI 训练,在同一 OpenAI 兼容 endpoint 用微调 model ID 部署。价格按训练 token + 托管平摊费。 **Q: 怎么监控成本和延迟?** A: fireworks.ai 仪表盘显示每模型 token 用量、成本、p50/p95 延迟。要 trace 级观测就用 Phoenix 或 Langfuse 注入 —— Fireworks 是 OpenAI 兼容,OpenInference 可用。 --- ## 来源与感谢 > Built by [Fireworks AI](https://github.com/fw-ai). Docs at [docs.fireworks.ai](https://docs.fireworks.ai). > > [fw-ai/forge](https://github.com/fw-ai) — open SDKs and tooling --- Source: https://tokrepo.com/en/workflows/fireworks-inference-100-open-models-on-openai-compat-api Author: Fireworks AI