KnowledgeMay 8, 2026·5 min read

Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

Fireworks runs serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Upload JSONL, get a deployed fine-tune in 30 min on the same endpoint.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 96/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Knowledge
Install
Single
Trust
Trust: New
Entrypoint
Asset
Universal CLI install command
npx tokrepo install 2f07f6a8-78ac-480a-b7a4-00282133dd4d
Intro

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.


Prepare training data (JSONL)

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

Submit job (Firectl CLI)

# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001

Use the fine-tune

resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"

Cost characteristics (May 2026)

Item Cost
Training ~$0.50 per 1M training tokens
Hosted inference (deployed LoRA) Same as base model rate
Idle hosting fee $0

When to fine-tune vs prompt-engineer

Symptom Use
Model gets the right answer with a 4-shot prompt Prompt
Need to match a specific output format perfectly Fine-tune
Domain jargon and tone consistency Fine-tune
Latency budget can't fit few-shot examples in context Fine-tune
Training data <50 examples Prompt

FAQ

Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.


Quick Use

  1. pip install fireworks-ai && firectl signin
  2. Prepare JSONL with {messages: [...]} per line
  3. firectl create fine-tuning-job --base-model llama-v3p1-8b-instruct --dataset NAME

Intro

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.


Prepare training data (JSONL)

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

Submit job (Firectl CLI)

# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001

Use the fine-tune

resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"

Cost characteristics (May 2026)

Item Cost
Training ~$0.50 per 1M training tokens
Hosted inference (deployed LoRA) Same as base model rate
Idle hosting fee $0

When to fine-tune vs prompt-engineer

Symptom Use
Model gets the right answer with a 4-shot prompt Prompt
Need to match a specific output format perfectly Fine-tune
Domain jargon and tone consistency Fine-tune
Latency budget can't fit few-shot examples in context Fine-tune
Training data <50 examples Prompt

FAQ

Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.


Source & Thanks

Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.

Firectl CLI MIT-licensed.

🙏

Source & Thanks

Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.

Firectl CLI MIT-licensed.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets