Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsMay 12, 2026·2 min de lectura

DataFlow — LLM Data Prep Pipelines + WebUI

DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with `dataflow -v`, then launch `dataflow webui`.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Asset
Comando CLI universal
npx tokrepo install 0cff37ca-924d-50d9-85f8-4f50fbef24cc
Introducción

DataFlow is an LLM data-prep system with operator pipelines; install via uv, validate with dataflow -v, then launch dataflow webui.

  • Best for: teams building domain datasets for fine-tuning, RL, or RAG who want reproducible operator pipelines
  • Works with: Python; uv; optional vLLM extra; WebUI for pipeline building (per README)
  • Setup time: 20–45 minutes

Practical Notes

  • Quant: the README shows dataflow -v output with open-dataflow codebase version: 1.0.0 (example).
  • Quant: WebUI was announced in README news as 2026-02-02, making it a recent workflow surface to standardize on.

Where DataFlow fits in an agent stack

If your team is already doing RAG or fine-tuning, DataFlow is useful when you want repeatable data quality loops:

  1. Generate candidates (from PDFs, logs, Q/A dumps).
  2. Refine with operator transforms.
  3. Evaluate + filter to keep only high-signal items.

A minimal first pipeline

  • Pick one narrow domain (e.g., “customer support → product X”).
  • Build a 100–500 sample dataset and run it through the same pipeline weekly.
  • Track two numbers: acceptance rate after filtering, and model quality delta after training or RAG updates.

The WebUI helps teams collaborate on pipeline structure without everyone editing code.

FAQ

Q: Do I need GPUs to start? A: No. The README describes optional GPU/vLLM installs, but you can validate the CLI and pipeline structure first.

Q: Why use uv? A: The README recommends uv for faster installs and reproducible environments.

Q: What should I measure? A: Dataset acceptance rate and downstream model quality deltas across weekly pipeline runs.

🙏

Fuente y agradecimientos

Source: https://github.com/OpenDCAI/DataFlow > License: Apache-2.0 > GitHub stars: 3,485 · forks: 340

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados