# DataFlow — LLM Data Prep Pipelines + WebUI > DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with `dataflow -v`, then launch `dataflow webui`. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: ## Quick Use 1. Install (recommended by README): ```bash pip install uv uv pip install open-dataflow ``` 2. Verify: ```bash dataflow -v ``` 3. Launch the WebUI: ```bash dataflow webui ``` ## Intro DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with `dataflow -v`, then launch `dataflow webui`. - **Best for:** teams building domain datasets for fine-tuning, RL, or RAG who want reproducible operator pipelines - **Works with:** Python; uv; optional vLLM extra; WebUI for pipeline building (per README) - **Setup time:** 20–45 minutes ## Practical Notes - Quant: the README shows `dataflow -v` output with **open-dataflow codebase version: 1.0.0** (example). - Quant: WebUI was announced in README news as **2026-02-02**, making it a recent workflow surface to standardize on. ## Where DataFlow fits in an agent stack If your team is already doing RAG or fine-tuning, DataFlow is useful when you want **repeatable data quality loops**: 1. **Generate** candidates (from PDFs, logs, Q/A dumps). 2. **Refine** with operator transforms. 3. **Evaluate + filter** to keep only high-signal items. ## A minimal first pipeline - Pick one narrow domain (e.g., “customer support → product X”). - Build a 100–500 sample dataset and run it through the same pipeline weekly. - Track two numbers: acceptance rate after filtering, and model quality delta after training or RAG updates. The WebUI helps teams collaborate on pipeline structure without everyone editing code. ### FAQ **Q: Do I need GPUs to start?** A: No. The README describes optional GPU/vLLM installs, but you can validate the CLI and pipeline structure first. **Q: Why use uv?** A: The README recommends uv for faster installs and reproducible environments. **Q: What should I measure?** A: Dataset acceptance rate and downstream model quality deltas across weekly pipeline runs. ## Source & Thanks > Source: https://github.com/OpenDCAI/DataFlow > License: Apache-2.0 > GitHub stars: 3,485 · forks: 340 --- ## 快速使用 1. 安装(README 推荐): ```bash pip install uv uv pip install open-dataflow ``` 2. 校验: ```bash dataflow -v ``` 3. 启动 WebUI: ```bash dataflow webui ``` ## 简介 DataFlow 用 operator/pipeline 方式生成、清洗、评估并过滤数据;用 uv 安装后可用 `dataflow -v` 校验,再用 `dataflow webui` 启动可视化流水线构建器。 - **适合谁:** 需要做领域数据集(微调/RL/RAG),并希望流水线可复现的团队 - **可搭配:** Python;uv;可选 vLLM extra;用于构建流水线的 WebUI(见 README) - **准备时间:** 20–45 分钟 ## 实战建议 - 量化信息:README 示例里 `dataflow -v` 输出包含 **open-dataflow codebase version: 1.0.0**。 - 量化信息:README 的 News 标注 WebUI 发布于 **2026-02-02**,适合纳入团队标准流程。 ## DataFlow 在 Agent 栈里的定位 如果你们已经在做 RAG 或微调,DataFlow 的价值是把“数据质量闭环”做成可复现的流水线: 1. **生成**候选数据(PDF、日志、低质 QA 等)。 2. **用 operator 精炼**(标准化、改写、去噪)。 3. **评估与过滤**,保留高信噪比样本。 ## 最小可行的首条流水线 - 先选一个窄领域(例如“客服→某个产品”)。 - 先做 100–500 条样本,每周按同一条 pipeline 重跑一次。 - 追两项指标:过滤后的通过率,以及训练/RAG 更新后的质量提升幅度。 WebUI 能降低协作成本:不用每个人都去改代码也能一起搭 pipeline。 ### FAQ **必须有 GPU 才能开始吗?** 答:不必须。README 提供可选 GPU/vLLM 安装;你可以先验证 CLI 与 pipeline 结构。 **为什么推荐 uv?** 答:README 推荐 uv 以加速安装并提升环境可复现性。 **该量化什么?** 答:过滤通过率与下游模型质量提升(按周重跑对比)。 ## 来源与感谢 > Source: https://github.com/OpenDCAI/DataFlow > License: Apache-2.0 > GitHub stars: 3,485 · forks: 340 --- Source: https://tokrepo.com/en/workflows/dataflow-llm-data-prep-pipelines-webui Author: Script Depot