Cette page est affichée en anglais. Une traduction française est en cours.
KnowledgeMay 8, 2026·5 min de lecture

Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

Fireworks runs serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Upload JSONL, get a deployed fine-tune in 30 min on the same endpoint.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 96/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Knowledge
Installation
Single
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 2f07f6a8-78ac-480a-b7a4-00282133dd4d
Introduction

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.


Prepare training data (JSONL)

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

Submit job (Firectl CLI)

# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001

Use the fine-tune

resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"

Cost characteristics (May 2026)

Item Cost
Training ~$0.50 per 1M training tokens
Hosted inference (deployed LoRA) Same as base model rate
Idle hosting fee $0

When to fine-tune vs prompt-engineer

Symptom Use
Model gets the right answer with a 4-shot prompt Prompt
Need to match a specific output format perfectly Fine-tune
Domain jargon and tone consistency Fine-tune
Latency budget can't fit few-shot examples in context Fine-tune
Training data <50 examples Prompt

FAQ

Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.


Quick Use

  1. pip install fireworks-ai && firectl signin
  2. Prepare JSONL with {messages: [...]} per line
  3. firectl create fine-tuning-job --base-model llama-v3p1-8b-instruct --dataset NAME

Intro

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.


Prepare training data (JSONL)

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

Submit job (Firectl CLI)

# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001

Use the fine-tune

resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"

Cost characteristics (May 2026)

Item Cost
Training ~$0.50 per 1M training tokens
Hosted inference (deployed LoRA) Same as base model rate
Idle hosting fee $0

When to fine-tune vs prompt-engineer

Symptom Use
Model gets the right answer with a 4-shot prompt Prompt
Need to match a specific output format perfectly Fine-tune
Domain jargon and tone consistency Fine-tune
Latency budget can't fit few-shot examples in context Fine-tune
Training data <50 examples Prompt

FAQ

Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.


Source & Thanks

Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.

Firectl CLI MIT-licensed.

🙏

Source et remerciements

Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.

Firectl CLI MIT-licensed.

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires