Quick Use
pip install fireworks-ai && firectl signin- Prepare JSONL with
{messages: [...]}per line firectl create fine-tuning-job --base-model llama-v3p1-8b-instruct --dataset NAME
Intro
Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.
Prepare training data (JSONL)
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.
Submit job (Firectl CLI)
# Install + log in
pip install fireworks-ai
firectl signin
# Upload dataset
firectl create dataset support-triage --file train.jsonl
# Launch fine-tune
firectl create fine-tuning-job \
--base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
--dataset support-triage \
--output-model my-support-triage-v1 \
--epochs 3 \
--learning-rate 0.0001Use the fine-tune
resp = client.chat.completions.create(
model="accounts/<your_account>/models/my-support-triage-v1",
messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content) # → "billing"Cost characteristics (May 2026)
| Item | Cost |
|---|---|
| Training | ~$0.50 per 1M training tokens |
| Hosted inference (deployed LoRA) | Same as base model rate |
| Idle hosting fee | $0 |
When to fine-tune vs prompt-engineer
| Symptom | Use |
|---|---|
| Model gets the right answer with a 4-shot prompt | Prompt |
| Need to match a specific output format perfectly | Fine-tune |
| Domain jargon and tone consistency | Fine-tune |
| Latency budget can't fit few-shot examples in context | Fine-tune |
| Training data <50 examples | Prompt |
FAQ
Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.
Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.
Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.
Source & Thanks
Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.
Firectl CLI MIT-licensed.