# DataFlow — LLM Data Prep Pipelines + WebUI

> DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with `dataflow -v`, then launch `dataflow webui`.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

1. Install (recommended by README):
   ```bash
   pip install uv
   uv pip install open-dataflow
   ```
2. Verify:
   ```bash
   dataflow -v
   ```
3. Launch the WebUI:
   ```bash
   dataflow webui
   ```

## Intro

DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with `dataflow -v`, then launch `dataflow webui`.

- **Best for:** teams building domain datasets for fine-tuning, RL, or RAG who want reproducible operator pipelines
- **Works with:** Python; uv; optional vLLM extra; WebUI for pipeline building (per README)
- **Setup time:** 20–45 minutes

## Practical Notes

- Quant: the README shows `dataflow -v` output with **open-dataflow codebase version: 1.0.0** (example).
- Quant: WebUI was announced in README news as **2026-02-02**, making it a recent workflow surface to standardize on.

## Where DataFlow fits in an agent stack

If your team is already doing RAG or fine-tuning, DataFlow is useful when you want **repeatable data quality loops**:

1. **Generate** candidates (from PDFs, logs, Q/A dumps).
2. **Refine** with operator transforms.
3. **Evaluate + filter** to keep only high-signal items.

## A minimal first pipeline

- Pick one narrow domain (e.g., “customer support → product X”).
- Build a 100–500 sample dataset and run it through the same pipeline weekly.
- Track two numbers: acceptance rate after filtering, and model quality delta after training or RAG updates.

The WebUI helps teams collaborate on pipeline structure without everyone editing code.

### FAQ

**Q: Do I need GPUs to start?**
A: No. The README describes optional GPU/vLLM installs, but you can validate the CLI and pipeline structure first.

**Q: Why use uv?**
A: The README recommends uv for faster installs and reproducible environments.

**Q: What should I measure?**
A: Dataset acceptance rate and downstream model quality deltas across weekly pipeline runs.

## Source & Thanks

> Source: https://github.com/OpenDCAI/DataFlow
> License: Apache-2.0
> GitHub stars: 3,485 · forks: 340

---

<!-- ZH -->

## 快速使用

1. 安装（README 推荐）：
   ```bash
   pip install uv
   uv pip install open-dataflow
   ```
2. 校验：
   ```bash
   dataflow -v
   ```
3. 启动 WebUI：
   ```bash
   dataflow webui
   ```

## 简介

DataFlow 用 operator/pipeline 方式生成、清洗、评估并过滤数据；用 uv 安装后可用 `dataflow -v` 校验，再用 `dataflow webui` 启动可视化流水线构建器。

- **适合谁：** 需要做领域数据集（微调/RL/RAG），并希望流水线可复现的团队
- **可搭配：** Python；uv；可选 vLLM extra；用于构建流水线的 WebUI（见 README）
- **准备时间：** 20–45 分钟

## 实战建议

- 量化信息：README 示例里 `dataflow -v` 输出包含 **open-dataflow codebase version: 1.0.0**。
- 量化信息：README 的 News 标注 WebUI 发布于 **2026-02-02**，适合纳入团队标准流程。

## DataFlow 在 Agent 栈里的定位

如果你们已经在做 RAG 或微调，DataFlow 的价值是把“数据质量闭环”做成可复现的流水线：

1. **生成**候选数据（PDF、日志、低质 QA 等）。
2. **用 operator 精炼**（标准化、改写、去噪）。
3. **评估与过滤**，保留高信噪比样本。

## 最小可行的首条流水线

- 先选一个窄领域（例如“客服→某个产品”）。
- 先做 100–500 条样本，每周按同一条 pipeline 重跑一次。
- 追两项指标：过滤后的通过率，以及训练/RAG 更新后的质量提升幅度。

WebUI 能降低协作成本：不用每个人都去改代码也能一起搭 pipeline。

### FAQ

**必须有 GPU 才能开始吗？**
答：不必须。README 提供可选 GPU/vLLM 安装；你可以先验证 CLI 与 pipeline 结构。

**为什么推荐 uv？**
答：README 推荐 uv 以加速安装并提升环境可复现性。

**该量化什么？**
答：过滤通过率与下游模型质量提升（按周重跑对比）。

## 来源与感谢

> Source: https://github.com/OpenDCAI/DataFlow
> License: Apache-2.0
> GitHub stars: 3,485 · forks: 340


---
Source: https://tokrepo.com/en/workflows/dataflow-llm-data-prep-pipelines-webui
Author: Script Depot