How do I install Fireworks Inference — 100+ Open Models on OpenAI-Compat API?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Esta página se muestra en inglés. Una traducción al español está en curso.

KnowledgeMay 8, 2026·5 min de lectura

Fireworks Inference — 100+ Open Models on OpenAI-Compat API

Name: Fireworks Inference — 100+ Open Models on OpenAI-Compat API
Author: Fireworks AI

Fireworks runs Llama, Mixtral, DeepSeek, Qwen, Phi via OpenAI-compat API. Sub-second TTFT, speculative decoding on flagship models.

Fireworks AI · Community

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Stage only · 15/100Stage only

Superficie agent

Cualquier agent MCP/CLI

Tipo

Knowledge

Instalación

Stage only

Confianza

Confianza: New

Entrada

Asset

Comando CLI universal

npx tokrepo install 63bacf7e-2334-4483-a208-b2b40b09383c

contrato de instalación JSON de metadata plan adaptador contenido raw

Introducción

Fireworks AI is a serverless inference platform for 100+ open-weight models — Llama 3.3, Mixtral, DeepSeek-V3, Qwen 2.5, Phi 4, plus image (Flux, SDXL) and audio (Whisper) models. The API is OpenAI-compatible at api.fireworks.ai/inference/v1. Sub-second time-to-first-token, speculative decoding on flagship models for 2-4× throughput. Best for: production apps that need fast OSS inference, teams switching off OpenAI for cost or compliance, anyone running DeepSeek-V3 or Llama 3.3 at scale. Works with: openai-python, openai-node, LangChain, LlamaIndex. Setup time: 3 minutes.

OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model	Fireworks model ID
Llama 3.3 70B	`accounts/fireworks/models/llama-v3p3-70b-instruct`
DeepSeek-V3	`accounts/fireworks/models/deepseek-v3`
DeepSeek-R1	`accounts/fireworks/models/deepseek-r1`
Qwen 2.5 72B	`accounts/fireworks/models/qwen2p5-72b-instruct`
Mixtral 8×22B	`accounts/fireworks/models/mixtral-8x22b-instruct`
Whisper v3	`accounts/fireworks/models/whisper-v3`
Flux dev	`accounts/fireworks/models/flux-1-dev-fp8`

Speculative decoding (flagship throughput boost)

Llama 3.3 70B and DeepSeek-V3 ship with speculative decoding enabled by default — a small draft model proposes tokens, the big model verifies in parallel. Net 2-4× throughput on long generations vs naive decoding. No code change needed.

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

Llama 3.3 70B: $0.90 input / $0.90 output
DeepSeek-V3: $0.90 / $0.90
DeepSeek-R1: $3.00 / $8.00
Qwen 2.5 72B: $0.90 / $0.90

FAQ

Q: Fireworks vs Together AI vs Groq? A: Groq is fastest for chat (LPU silicon, ~280 tok/s on Llama 3.3) but limited model catalog. Fireworks and Together both offer 100+ models with similar pricing; Fireworks edges ahead on throughput and image-gen, Together on long-context Llama variants.

Q: Does Fireworks support fine-tunes? A: Yes — serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Train via the Firectl CLI, deploy on the same OpenAI-compatible endpoint with your fine-tune model ID. Pricing is per training token + flat hosting fee.

Q: How do I monitor cost and latency? A: Fireworks dashboard at fireworks.ai shows token usage, cost, p50/p95 latency per model. For trace-level observability, instrument with Phoenix or Langfuse — OpenInference works since Fireworks is OpenAI-compat.

Quick Use

Get FIREWORKS_API_KEY at fireworks.ai
OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key=KEY)
Use model='accounts/fireworks/models/llama-v3p3-70b-instruct'

Intro

OpenAI-compatible client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

resp = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about open-source AI"}],
)
print(resp.choices[0].message.content)

Production model paths

Model	Fireworks model ID
Llama 3.3 70B	`accounts/fireworks/models/llama-v3p3-70b-instruct`
DeepSeek-V3	`accounts/fireworks/models/deepseek-v3`
DeepSeek-R1	`accounts/fireworks/models/deepseek-r1`
Qwen 2.5 72B	`accounts/fireworks/models/qwen2p5-72b-instruct`
Mixtral 8×22B	`accounts/fireworks/models/mixtral-8x22b-instruct`
Whisper v3	`accounts/fireworks/models/whisper-v3`
Flux dev	`accounts/fireworks/models/flux-1-dev-fp8`

Speculative decoding (flagship throughput boost)

Image generation

import requests

r = requests.post(
    "https://api.fireworks.ai/inference/v1/image_generation/accounts/fireworks/models/flux-1-dev-fp8",
    headers={"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}", "Accept": "image/jpeg"},
    json={"prompt": "Cyberpunk Tokyo skyline at dusk, photorealistic", "aspect_ratio": "16:9", "steps": 30},
)
open("output.jpg", "wb").write(r.content)

Pricing snapshot (per 1M tokens, May 2026)

Llama 3.3 70B: $0.90 input / $0.90 output
DeepSeek-V3: $0.90 / $0.90
DeepSeek-R1: $3.00 / $8.00
Qwen 2.5 72B: $0.90 / $0.90

FAQ

Source & Thanks

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

🙏

Fuente y agradecimientos

Built by Fireworks AI. Docs at docs.fireworks.ai.

fw-ai/forge — open SDKs and tooling

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

GroqCloud runs Llama 3.3 70B at 250+ tok/sec on LPU silicon. OpenAI-compatible API. Free tier, sub-second TTFT, ideal for streaming.

Knowledge

Groq

Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

Fireworks runs serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Upload JSONL, get a deployed fine-tune in 30 min on the same endpoint.

Knowledge

Fireworks AI

DeepSeek Coder — Code-Specialized Model for Local Inference

DeepSeek Coder is the code-specialized open-weight model with FIM (fill-in-middle) support. Beats Codestral on HumanEval. Drops into Continue, Aider.

Knowledge

DeepSeek

DeepSeek-V3 — Open-Weight 671B MoE Model with GPT-4o Quality

DeepSeek-V3 is a 671B-param MoE model (37B active per token). Matches GPT-4o on benchmarks. MIT-licensed weights, $0.27/1M input on the hosted API.

Knowledge

DeepSeek