SkillsMay 12, 2026·2 min read

DataFlow — LLM Data Prep Pipelines + WebUI

DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with `dataflow -v`, then launch `dataflow webui`.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
Asset
Universal CLI install command
npx tokrepo install 0cff37ca-924d-50d9-85f8-4f50fbef24cc
Intro

DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with dataflow -v, then launch dataflow webui.

  • Best for: teams building domain datasets for fine-tuning, RL, or RAG who want reproducible operator pipelines
  • Works with: Python; uv; optional vLLM extra; WebUI for pipeline building (per README)
  • Setup time: 20–45 minutes

Practical Notes

  • Quant: the README shows dataflow -v output with open-dataflow codebase version: 1.0.0 (example).
  • Quant: WebUI was announced in README news as 2026-02-02, making it a recent workflow surface to standardize on.

Where DataFlow fits in an agent stack

If your team is already doing RAG or fine-tuning, DataFlow is useful when you want repeatable data quality loops:

  1. Generate candidates (from PDFs, logs, Q/A dumps).
  2. Refine with operator transforms.
  3. Evaluate + filter to keep only high-signal items.

A minimal first pipeline

  • Pick one narrow domain (e.g., “customer support → product X”).
  • Build a 100–500 sample dataset and run it through the same pipeline weekly.
  • Track two numbers: acceptance rate after filtering, and model quality delta after training or RAG updates.

The WebUI helps teams collaborate on pipeline structure without everyone editing code.

FAQ

Q: Do I need GPUs to start? A: No. The README describes optional GPU/vLLM installs, but you can validate the CLI and pipeline structure first.

Q: Why use uv? A: The README recommends uv for faster installs and reproducible environments.

Q: What should I measure? A: Dataset acceptance rate and downstream model quality deltas across weekly pipeline runs.

🙏

Source & Thanks

Source: https://github.com/OpenDCAI/DataFlow > License: Apache-2.0 > GitHub stars: 3,485 · forks: 340

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets