# Fireworks Fine-Tuning — Serverless LoRA on Llama in 30 min

> Fireworks runs serverless LoRA fine-tuning on Llama, Qwen, Mixtral. Upload JSONL, get a deployed fine-tune in 30 min on the same endpoint.

## Install

Copy the content below into your project:

## Quick Use

1. `pip install fireworks-ai && firectl signin`
2. Prepare JSONL with `{messages: [...]}` per line
3. `firectl create fine-tuning-job --base-model llama-v3p1-8b-instruct --dataset NAME`

---

## Intro

Fireworks Fine-Tuning runs serverless LoRA on Llama 3.x, Qwen 2.5, and Mixtral — upload a JSONL training file via the Firectl CLI, wait 30-60 minutes, your fine-tune is deployed at the same OpenAI-compatible endpoint with a new model ID. No GPU rental, no idle hosting fee. Best for: classification heads on top of Llama 8B, instruction-following adapters, domain-tone tuning, distilling GPT-4o behavior into a cheap base model. Works with: any client that hits Fireworks. Setup time: 30 minutes from JSONL to live model.

---

### Prepare training data (JSONL)

```jsonl
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"My card was charged twice"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"Site down for an hour"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"Classify support tickets as urgent / billing / general."},{"role":"user","content":"How do I export data?"},{"role":"assistant","content":"general"}]}
```

200-2,000 examples is the sweet spot for LoRA. Below 100 → underfit, above 5,000 → diminishing returns for most domain-tone tasks.

### Submit job (Firectl CLI)

```bash
# Install + log in
pip install fireworks-ai
firectl signin

# Upload dataset
firectl create dataset support-triage --file train.jsonl

# Launch fine-tune
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001
```

### Use the fine-tune

```python
resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "Refund didn't go through"}],
)
print(resp.choices[0].message.content)  # → "billing"
```

### Cost characteristics (May 2026)

| Item | Cost |
|---|---|
| Training | ~$0.50 per 1M training tokens |
| Hosted inference (deployed LoRA) | Same as base model rate |
| Idle hosting fee | $0 |

### When to fine-tune vs prompt-engineer

| Symptom | Use |
|---|---|
| Model gets the right answer with a 4-shot prompt | **Prompt** |
| Need to match a specific output format perfectly | **Fine-tune** |
| Domain jargon and tone consistency | **Fine-tune** |
| Latency budget can't fit few-shot examples in context | **Fine-tune** |
| Training data <50 examples | **Prompt** |

---

### FAQ

**Q: How long does training take?**
A: 30-60 minutes for typical 1K-example LoRA on Llama 8B. Larger datasets or 70B base model can run 2-4 hours. Firectl shows live progress; you can check status from Firectl or the dashboard.

**Q: Can I download my fine-tune weights?**
A: Yes for LoRA adapters — Firectl exports the safetensors. The base model isn't redistributable but the adapter you trained is yours. Useful if you want to host the same LoRA on a self-managed GPU later.

**Q: Does it support full fine-tuning (not LoRA)?**
A: Currently LoRA-only on the serverless plan. Full fine-tuning is available on Fireworks dedicated deployments where you rent GPUs hourly. For most domain-tuning tasks LoRA is the right tradeoff.

---

## Source & Thanks

> Built by [Fireworks AI](https://github.com/fw-ai). Fine-tuning docs at [docs.fireworks.ai/fine-tuning](https://docs.fireworks.ai/fine-tuning).
>
> Firectl CLI MIT-licensed.

---

<!-- ZH -->

## 快速使用

1. `pip install fireworks-ai && firectl signin`
2. 准备 JSONL，每行一个 `{messages: [...]}`
3. `firectl create fine-tuning-job --base-model llama-v3p1-8b-instruct --dataset NAME`

---

## 简介

Fireworks 微调在 Llama 3.x / Qwen 2.5 / Mixtral 上跑无服务器 LoRA —— Firectl CLI 上传 JSONL 训练文件，等 30-60 分钟，微调结果部署在同一 OpenAI 兼容 endpoint，model ID 不一样。不租 GPU、不付闲置托管费。适合 Llama 8B 之上的分类头、指令跟随适配器、领域语气调优、把 GPT-4o 行为蒸馏到便宜底模。任何打 Fireworks 的客户端都能用。装机时间：从 JSONL 到上线模型 30 分钟。

---

### 准备训练数据（JSONL）

```jsonl
{"messages":[{"role":"system","content":"把客服工单分流成 urgent / billing / general。"},{"role":"user","content":"我的卡被刷两次"},{"role":"assistant","content":"billing"}]}
{"messages":[{"role":"system","content":"把客服工单分流成 urgent / billing / general。"},{"role":"user","content":"网站挂了一小时"},{"role":"assistant","content":"urgent"}]}
{"messages":[{"role":"system","content":"把客服工单分流成 urgent / billing / general。"},{"role":"user","content":"怎么导出数据？"},{"role":"assistant","content":"general"}]}
```

LoRA 的甜点是 200-2,000 例。<100 欠拟合，>5,000 大部分领域语气任务边际收益递减。

### 提交 job（Firectl CLI）

```bash
# 装 + 登录
pip install fireworks-ai
firectl signin

# 上传数据集
firectl create dataset support-triage --file train.jsonl

# 启动微调
firectl create fine-tuning-job \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --dataset support-triage \
  --output-model my-support-triage-v1 \
  --epochs 3 \
  --learning-rate 0.0001
```

### 用微调

```python
resp = client.chat.completions.create(
    model="accounts/<your_account>/models/my-support-triage-v1",
    messages=[{"role": "user", "content": "退款没到账"}],
)
print(resp.choices[0].message.content)  # → "billing"
```

### 成本特征（2026 年 5 月）

| 项目 | 成本 |
|---|---|
| 训练 | 每百万训练 token ~$0.50 |
| 托管推理（已部署 LoRA）| 跟底模同价 |
| 闲置托管费 | $0 |

### 微调还是 prompt 工程？

| 现象 | 用什么 |
|---|---|
| 4-shot prompt 能给对答案 | **Prompt** |
| 需要严丝合缝匹配特定输出格式 | **微调** |
| 领域术语和语气一致 | **微调** |
| 延迟预算装不下 few-shot 例子 | **微调** |
| 训练数据 <50 例 | **Prompt** |

---

### FAQ

**Q: 训练多久？**
A: Llama 8B 上典型 1K 例 LoRA 30-60 分钟。更大数据集或 70B 底模 2-4 小时。Firectl 显示实时进度；可以从 Firectl 或仪表盘看状态。

**Q: 能下载微调权重吗？**
A: LoRA 适配器可以 —— Firectl 导出 safetensors。底模不可二次分发，但你训的 adapter 归你。后面想在自管 GPU 上跑同一 LoRA 就有用。

**Q: 支持全量微调吗（不是 LoRA）？**
A: 无服务器档目前只支持 LoRA。全量微调在 Fireworks 专属部署上可用，按小时租 GPU。大多数领域调优 LoRA 就是对的折中。

---

## 来源与感谢

> Built by [Fireworks AI](https://github.com/fw-ai). Fine-tuning docs at [docs.fireworks.ai/fine-tuning](https://docs.fireworks.ai/fine-tuning).
>
> Firectl CLI MIT-licensed.


---
Source: https://tokrepo.com/en/workflows/fireworks-fine-tuning-serverless-lora-on-llama-in-30-min
Author: Fireworks AI