Esta página se muestra en inglés. Una traducción al español está en curso.
KnowledgeMay 8, 2026·5 min de lectura

Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

Fireworks runs serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Upload JSONL, get a deployed fine-tune in 30 min on the same endpoint.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 96/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Knowledge
Instalación
Single
Confianza
Confianza: New
Entrada
Asset
Comando CLI universal
npx tokrepo install 2f07f6a8-78ac-480a-b7a4-00282133dd4d
Introducción

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.


Prepare training data (JSONL)

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

Submit job (Firectl CLI)

# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001

Use the fine-tune

resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"

Cost characteristics (May 2026)

Item Cost
Training ~$0.50 per 1M training tokens
Hosted inference (deployed LoRA) Same as base model rate
Idle hosting fee $0

When to fine-tune vs prompt-engineer

Symptom Use
Model gets the right answer with a 4-shot prompt Prompt
Need to match a specific output format perfectly Fine-tune
Domain jargon and tone consistency Fine-tune
Latency budget can't fit few-shot examples in context Fine-tune
Training data <50 examples Prompt

FAQ

Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.


Quick Use

  1. pip install fireworks-ai && firectl signin
  2. Prepare JSONL with {messages: [...]} per line
  3. firectl create fine-tuning-job --base-model llama-v3p1-8b-instruct --dataset NAME

Intro

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.


Prepare training data (JSONL)

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

Submit job (Firectl CLI)

# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001

Use the fine-tune

resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"

Cost characteristics (May 2026)

Item Cost
Training ~$0.50 per 1M training tokens
Hosted inference (deployed LoRA) Same as base model rate
Idle hosting fee $0

When to fine-tune vs prompt-engineer

Symptom Use
Model gets the right answer with a 4-shot prompt Prompt
Need to match a specific output format perfectly Fine-tune
Domain jargon and tone consistency Fine-tune
Latency budget can't fit few-shot examples in context Fine-tune
Training data <50 examples Prompt

FAQ

Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.


Source & Thanks

Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.

Firectl CLI MIT-licensed.

🙏

Fuente y agradecimientos

Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.

Firectl CLI MIT-licensed.

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados