How do I install Fireworks Inference — 100+ Open Models on OpenAI-Compat API?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Cette page est affichée en anglais. Une traduction française est en cours.

KnowledgeMay 8, 2026·5 min de lecture

Fireworks Inference — 100+ Open Models on OpenAI-Compat API

Name: Fireworks Inference — 100+ Open Models on OpenAI-Compat API
Author: Fireworks AI

Fireworks runs Llama, Mixtral, DeepSeek, Qwen, Phi via OpenAI-compat API. Sub-second TTFT, speculative decoding on flagship models.

Fireworks AI · Community

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 15/100Stage only

Surface agent

Tout agent MCP/CLI

Type

Knowledge

Installation

Stage only

Confiance

Confiance : New

Point d'entrée

Asset

Commande CLI universelle

npx tokrepo install 63bacf7e-2334-4483-a208-b2b40b09383c

contrat d'installation JSON metadata plan adaptateur contenu raw

Introduction

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.

OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model	Fireworks model ID
Llama 3.3 70B	`accounts/fireworks/models/llama-v3p3-70b-instruct`
DeepSeek-V3	`accounts/fireworks/models/deepseek-v3`
DeepSeek-R1	`accounts/fireworks/models/deepseek-r1`
Qwen 2.5 72B	`accounts/fireworks/models/qwen2p5-72b-instruct`
Mixtral 8×22B	`accounts/fireworks/models/mixtral-8x22b-instruct`
Whisper v3	`accounts/fireworks/models/whisper-v3`
Flux dev	`accounts/fireworks/models/flux-1-dev-fp8`

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

Llama 3.3 70B: $0.90 input / $0.90 output
DeepSeek-V3: $0.90 / $0.90
DeepSeek-R1: $3.00 / $8.00
Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.

Quick Use

Get FIREWORKS_API_KEY at fireworks.ai
OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)
Use model='accounts/fireworks/models/llama-v3p3-70b-instruct'

Intro

OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model	Fireworks model ID
Llama 3.3 70B	`accounts/fireworks/models/llama-v3p3-70b-instruct`
DeepSeek-V3	`accounts/fireworks/models/deepseek-v3`
DeepSeek-R1	`accounts/fireworks/models/deepseek-r1`
Qwen 2.5 72B	`accounts/fireworks/models/qwen2p5-72b-instruct`
Mixtral 8×22B	`accounts/fireworks/models/mixtral-8x22b-instruct`
Whisper v3	`accounts/fireworks/models/whisper-v3`
Flux dev	`accounts/fireworks/models/flux-1-dev-fp8`

Speculative decoding (flagship throughput boost)

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

Llama 3.3 70B: $0.90 input / $0.90 output
DeepSeek-V3: $0.90 / $0.90
DeepSeek-R1: $3.00 / $8.00
Qwen 2.5 72B: $0.90 / $0.90

FAQ

Source & Thanks

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

🙏

Source et remerciements

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

GroqCloud runs Llama 3.3 70B at 250+ tok/sec on LPU silicon. OpenAI-compatible API. Free tier, sub-second TTFT, ideal for streaming.

Knowledge

Groq

Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

Fireworks runs serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Upload JSONL, get a deployed fine-tune in 30 min on the same endpoint.

Knowledge

Fireworks AI

DeepSeek Coder — Code-Specialized Model for Local Inference

DeepSeek Coder is the code-specialized open-weight model with FIM (fill-in-middle) support. Beats Codestral on HumanEval. Drops into Continue, Aider.

Knowledge

DeepSeek

DeepSeek-V3 — Open-Weight 671B MoE Model with GPT-4o Quality

DeepSeek-V3 is a 671B-param MoE model (37B active per token). Matches GPT-4o on benchmarks. MIT-licensed weights, $0.27/1M input on the hosted API.

Knowledge

DeepSeek