Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 19, 2026·3 min de lectura

Data Juicer — Data Processing Pipeline for Foundation Models

Data Juicer is a data processing toolkit designed for building and curating training datasets for large language models and multimodal models. It provides over 100 composable operators for filtering, deduplication, and quality analysis of text, image, audio, and video data.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Data Juicer Overview
Comando CLI universal
npx tokrepo install b8e5b8fc-5318-11f1-9bc6-00163e2b0d79

Introduction

Data Juicer is an open-source system for producing high-quality training data for foundation models. It offers a library of reusable operators for cleaning, filtering, deduplicating, and analyzing data across text, image, audio, and video modalities.

What Data Juicer Does

  • Cleans and filters large-scale training datasets with over 100 built-in operators
  • Deduplicates data using MinHash, SimHash, or exact matching at scale
  • Analyzes data quality with statistics and visualization tools
  • Processes multimodal data including text, images, audio, and video in unified pipelines
  • Supports distributed execution via Ray for handling terabyte-scale datasets

Architecture Overview

Data Juicer organizes processing as a pipeline of composable operators defined in a YAML recipe. Each operator is categorized as a Formatter, Mapper, Filter, or Deduplicator. The execution engine can run locally for small datasets or distribute work across a Ray cluster for large-scale processing. Intermediate results are checkpointed so pipelines can resume after failures.

Self-Hosting & Configuration

  • Install via pip and define processing pipelines in YAML recipe files
  • Specify input datasets in JSON, Parquet, or Hugging Face format
  • Chain operators for language detection, text length filtering, perplexity scoring, and more
  • Configure Ray for distributed processing across multiple nodes
  • Use the built-in analysis tools to visualize data distributions before and after processing

Key Features

  • 100+ built-in operators covering text, image, audio, and video modalities
  • YAML-based recipes for reproducible and shareable data processing pipelines
  • Scales from single-machine to distributed clusters using Ray
  • Data quality analysis with visual reports for informed operator selection
  • Supports synthetic data generation and data mixing strategies for training recipes

Comparison with Similar Tools

  • Dolma — AI2 toolkit focused on text-only web data; Data Juicer handles multimodal data
  • RedPajama — Provides curated datasets but not a general-purpose processing framework
  • Unstructured — Focused on document parsing and extraction, not training data curation
  • Datatrove — Hugging Face text processing library; Data Juicer adds multimodal support and analysis
  • NeMo Curator — NVIDIA toolkit tightly coupled with NeMo; Data Juicer is framework-agnostic

FAQ

Q: Does Data Juicer work with non-English data? A: Yes. Many operators support multilingual text and language detection is built in.

Q: Can I add custom operators? A: Yes. You can write custom operators in Python and register them for use in YAML recipes.

Q: What scale can Data Juicer handle? A: With Ray, it has been tested on datasets exceeding 100 billion tokens.

Q: Is Data Juicer only for LLM training data? A: While designed for foundation model data, its operators are general enough for any data cleaning or analysis task.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados