Cette page est affichée en anglais. Une traduction française est en cours.
SkillsMay 12, 2026·2 min de lecture

DataFlow — LLM Data Prep Pipelines + WebUI

DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with `dataflow -v`, then launch `dataflow webui`.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 0cff37ca-924d-50d9-85f8-4f50fbef24cc
Introduction

DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with dataflow -v, then launch dataflow webui.

  • Best for: teams building domain datasets for fine-tuning, RL, or RAG who want reproducible operator pipelines
  • Works with: Python; uv; optional vLLM extra; WebUI for pipeline building (per README)
  • Setup time: 20–45 minutes

Practical Notes

  • Quant: the README shows dataflow -v output with open-dataflow codebase version: 1.0.0 (example).
  • Quant: WebUI was announced in README news as 2026-02-02, making it a recent workflow surface to standardize on.

Where DataFlow fits in an agent stack

If your team is already doing RAG or fine-tuning, DataFlow is useful when you want repeatable data quality loops:

  1. Generate candidates (from PDFs, logs, Q/A dumps).
  2. Refine with operator transforms.
  3. Evaluate + filter to keep only high-signal items.

A minimal first pipeline

  • Pick one narrow domain (e.g., “customer support → product X”).
  • Build a 100–500 sample dataset and run it through the same pipeline weekly.
  • Track two numbers: acceptance rate after filtering, and model quality delta after training or RAG updates.

The WebUI helps teams collaborate on pipeline structure without everyone editing code.

FAQ

Q: Do I need GPUs to start? A: No. The README describes optional GPU/vLLM installs, but you can validate the CLI and pipeline structure first.

Q: Why use uv? A: The README recommends uv for faster installs and reproducible environments.

Q: What should I measure? A: Dataset acceptance rate and downstream model quality deltas across weekly pipeline runs.

🙏

Source et remerciements

Source: https://github.com/OpenDCAI/DataFlow > License: Apache-2.0 > GitHub stars: 3,485 · forks: 340

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires