How do I install Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]} {"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]} {"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

# Install + log in pip install fireworks-ai firectl signin # Upload dataset firectl create dataset support-triage --file train.jsonl # Launch fine-tune firectl create fine-tuning-job \ --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \ --dataset support-triage \ --output-model my-support-triage-v1 \ --epochs 3 \ --learning-rate 0.0001

resp = client.chat.completions.create( model="accounts/<your_account>/models/my-support-triage-v1", messages=[{"role": "user", "content": "Refund didn't go through"}], ) print(resp.choices[0].message.content) # → "billing"

Item

Cost

Training

~$0.50 per 1M training tokens

Hosted inference (deployed LoRA)

Same as base model rate

Idle hosting fee

Symptom

Use

Model gets the right answer with a 4-shot prompt

Prompt

Need to match a specific output format perfectly

Fine-tune

Domain jargon and tone consistency

Fine-tune

Latency budget can't fit few-shot examples in context

Fine-tune

Training data <50 examples

Prompt

Quick Use

pip install fireworks-ai && firectl signin
Prepare JSONL with {messages: [...]} per line
firectl create fine-tuning-job --base-model llama-v3p1-8b-instruct --dataset NAME

Intro

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.

Prepare training data (JSONL)

{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

Submit job (Firectl CLI)

# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001

Use the fine-tune

resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"

Cost characteristics (May 2026)

Item	Cost
Training	~$0.50 per 1M training tokens
Hosted inference (deployed LoRA)	Same as base model rate
Idle hosting fee	$0

When to fine-tune vs prompt-engineer

Symptom	Use
Model gets the right answer with a 4-shot prompt	Prompt
Need to match a specific output format perfectly	Fine-tune
Domain jargon and tone consistency	Fine-tune
Latency budget can't fit few-shot examples in context	Fine-tune
Training data <50 examples	Prompt

FAQ

Q: How long does training take? A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

Q: Can I download my fine-tune weights? A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

Q: Does it support full fine-tuning (not LoRA)? A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.

Source & Thanks

Built by Fireworks AI. Fine-tuning docs at docs.fireworks.ai/fine-tuning.

Firectl CLI MIT-licensed.

Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

This asset can be read and installed directly by agents

Prepare training data (JSONL)

Submit job (Firectl CLI)

Use the fine-tune

Cost characteristics (May 2026)

When to fine-tune vs prompt-engineer

FAQ

Quick Use

Intro

Prepare training data (JSONL)

Submit job (Firectl CLI)

Use the fine-tune

Cost characteristics (May 2026)

When to fine-tune vs prompt-engineer

FAQ

Source & Thanks

Source & Thanks

Discussion

Related Assets

Fireworks Inference — 100+ Open Models on OpenAI-Compat API

GroqCloud Quickstart — 250 tokens/sec OpenAI-Compat API

SWE-bench — Benchmark for Coding Agents

Weave — Trace and Debug LLM Apps