Scripts2026年5月19日·1 分钟阅读

Data Juicer — Data Processing Pipeline for Foundation Models

Data Juicer is a data processing toolkit designed for building and curating training datasets for large language models and multimodal models. It provides over 100 composable operators for filtering, deduplication, and quality analysis of text, image, audio, and video data.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Data Juicer Overview
通用 CLI 安装命令
npx tokrepo install b8e5b8fc-5318-11f1-9bc6-00163e2b0d79

Introduction

Data Juicer is an open-source system for producing high-quality training data for foundation models. It offers a library of reusable operators for cleaning, filtering, deduplicating, and analyzing data across text, image, audio, and video modalities.

What Data Juicer Does

  • Cleans and filters large-scale training datasets with over 100 built-in operators
  • Deduplicates data using MinHash, SimHash, or exact matching at scale
  • Analyzes data quality with statistics and visualization tools
  • Processes multimodal data including text, images, audio, and video in unified pipelines
  • Supports distributed execution via Ray for handling terabyte-scale datasets

Architecture Overview

Data Juicer organizes processing as a pipeline of composable operators defined in a YAML recipe. Each operator is categorized as a Formatter, Mapper, Filter, or Deduplicator. The execution engine can run locally for small datasets or distribute work across a Ray cluster for large-scale processing. Intermediate results are checkpointed so pipelines can resume after failures.

Self-Hosting & Configuration

  • Install via pip and define processing pipelines in YAML recipe files
  • Specify input datasets in JSON, Parquet, or Hugging Face format
  • Chain operators for language detection, text length filtering, perplexity scoring, and more
  • Configure Ray for distributed processing across multiple nodes
  • Use the built-in analysis tools to visualize data distributions before and after processing

Key Features

  • 100+ built-in operators covering text, image, audio, and video modalities
  • YAML-based recipes for reproducible and shareable data processing pipelines
  • Scales from single-machine to distributed clusters using Ray
  • Data quality analysis with visual reports for informed operator selection
  • Supports synthetic data generation and data mixing strategies for training recipes

Comparison with Similar Tools

  • Dolma — AI2 toolkit focused on text-only web data; Data Juicer handles multimodal data
  • RedPajama — Provides curated datasets but not a general-purpose processing framework
  • Unstructured — Focused on document parsing and extraction, not training data curation
  • Datatrove — Hugging Face text processing library; Data Juicer adds multimodal support and analysis
  • NeMo Curator — NVIDIA toolkit tightly coupled with NeMo; Data Juicer is framework-agnostic

FAQ

Q: Does Data Juicer work with non-English data? A: Yes. Many operators support multilingual text and language detection is built in.

Q: Can I add custom operators? A: Yes. You can write custom operators in Python and register them for use in YAML recipes.

Q: What scale can Data Juicer handle? A: With Ray, it has been tested on datasets exceeding 100 billion tokens.

Q: Is Data Juicer only for LLM training data? A: While designed for foundation model data, its operators are general enough for any data cleaning or analysis task.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产