How do I install Data Juicer — Data Processing Pipeline for Foundation Models?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Data Juicer — Data Processing Pipeline for Foundation Models

Introduction

Data Juicer is an open-source system for producing high-quality training data for foundation models. It offers a library of reusable operators for cleaning, filtering, deduplicating, and analyzing data across text, image, audio, and video modalities.

What Data Juicer Does

Cleans and filters large-scale training datasets with over 100 built-in operators
Deduplicates data using MinHash, SimHash, or exact matching at scale
Analyzes data quality with statistics and visualization tools
Processes multimodal data including text, images, audio, and video in unified pipelines
Supports distributed execution via Ray for handling terabyte-scale datasets

Architecture Overview

Data Juicer organizes processing as a pipeline of composable operators defined in a YAML recipe. Each operator is categorized as a Formatter, Mapper, Filter, or Deduplicator. The execution engine can run locally for small datasets or distribute work across a Ray cluster for large-scale processing. Intermediate results are checkpointed so pipelines can resume after failures.

Self-Hosting & Configuration

Install via pip and define processing pipelines in YAML recipe files
Specify input datasets in JSON, Parquet, or Hugging Face format
Chain operators for language detection, text length filtering, perplexity scoring, and more
Configure Ray for distributed processing across multiple nodes
Use the built-in analysis tools to visualize data distributions before and after processing

Key Features

100+ built-in operators covering text, image, audio, and video modalities
YAML-based recipes for reproducible and shareable data processing pipelines
Scales from single-machine to distributed clusters using Ray
Data quality analysis with visual reports for informed operator selection
Supports synthetic data generation and data mixing strategies for training recipes

Comparison with Similar Tools

Dolma — AI2 toolkit focused on text-only web data; Data Juicer handles multimodal data
RedPajama — Provides curated datasets but not a general-purpose processing framework
Unstructured — Focused on document parsing and extraction, not training data curation
Datatrove — Hugging Face text processing library; Data Juicer adds multimodal support and analysis
NeMo Curator — NVIDIA toolkit tightly coupled with NeMo; Data Juicer is framework-agnostic

FAQ

Q: Does Data Juicer work with non-English data? A: Yes. Many operators support multilingual text and language detection is built in.

Q: Can I add custom operators? A: Yes. You can write custom operators in Python and register them for use in YAML recipes.

Q: What scale can Data Juicer handle? A: With Ray, it has been tested on datasets exceeding 100 billion tokens.

Q: Is Data Juicer only for LLM training data? A: While designed for foundation model data, its operators are general enough for any data cleaning or analysis task.

Data Juicer — Data Processing Pipeline for Foundation Models

这个资产可以被 Agent 直接读取和安装

Introduction

What Data Juicer Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Logstash — Server-Side Data Processing Pipeline

Apache Beam — Unified Batch and Stream Data Processing

pandas — Powerful Data Analysis and Manipulation for Python

Apache Flink — Stream Processing Framework for Real-Time Data