Introduction
Data Juicer is an open-source system for producing high-quality training data for foundation models. It offers a library of reusable operators for cleaning, filtering, deduplicating, and analyzing data across text, image, audio, and video modalities.
What Data Juicer Does
- Cleans and filters large-scale training datasets with over 100 built-in operators
- Deduplicates data using MinHash, SimHash, or exact matching at scale
- Analyzes data quality with statistics and visualization tools
- Processes multimodal data including text, images, audio, and video in unified pipelines
- Supports distributed execution via Ray for handling terabyte-scale datasets
Architecture Overview
Data Juicer organizes processing as a pipeline of composable operators defined in a YAML recipe. Each operator is categorized as a Formatter, Mapper, Filter, or Deduplicator. The execution engine can run locally for small datasets or distribute work across a Ray cluster for large-scale processing. Intermediate results are checkpointed so pipelines can resume after failures.
Self-Hosting & Configuration
- Install via pip and define processing pipelines in YAML recipe files
- Specify input datasets in JSON, Parquet, or Hugging Face format
- Chain operators for language detection, text length filtering, perplexity scoring, and more
- Configure Ray for distributed processing across multiple nodes
- Use the built-in analysis tools to visualize data distributions before and after processing
Key Features
- 100+ built-in operators covering text, image, audio, and video modalities
- YAML-based recipes for reproducible and shareable data processing pipelines
- Scales from single-machine to distributed clusters using Ray
- Data quality analysis with visual reports for informed operator selection
- Supports synthetic data generation and data mixing strategies for training recipes
Comparison with Similar Tools
- Dolma — AI2 toolkit focused on text-only web data; Data Juicer handles multimodal data
- RedPajama — Provides curated datasets but not a general-purpose processing framework
- Unstructured — Focused on document parsing and extraction, not training data curation
- Datatrove — Hugging Face text processing library; Data Juicer adds multimodal support and analysis
- NeMo Curator — NVIDIA toolkit tightly coupled with NeMo; Data Juicer is framework-agnostic
FAQ
Q: Does Data Juicer work with non-English data? A: Yes. Many operators support multilingual text and language detection is built in.
Q: Can I add custom operators? A: Yes. You can write custom operators in Python and register them for use in YAML recipes.
Q: What scale can Data Juicer handle? A: With Ray, it has been tested on datasets exceeding 100 billion tokens.
Q: Is Data Juicer only for LLM training data? A: While designed for foundation model data, its operators are general enough for any data cleaning or analysis task.