# DocETL — LLM-Powered Document Processing Pipelines

> Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

## Install

Copy the content below into your project:

# DocETL — LLM-Powered Document Processing Pipelines

## Quick Use

```bash
pip install docetl
```

Create a pipeline YAML file (`pipeline.yaml`):

```yaml
datasets:
  papers:
    type: file
    path: "papers.json"

operations:
  - name: summarize
    type: map
    prompt: |
      Summarize the following research paper in 3 sentences:
      {{ input.content }}
    output:
      schema:
        summary: string

  - name: group_by_topic
    type: reduce
    reduce_key: topic
    prompt: |
      Given these paper summaries about {{ reduce_key }}:
      {% for item in inputs %}
      - {{ item.summary }}
      {% endfor %}
      Write a comprehensive overview of this research area.
    output:
      schema:
        overview: string

pipeline:
  steps:
    - name: summarize_papers
      input: papers
      operations: [summarize]
    - name: create_overview
      input: summarize_papers
      operations: [group_by_topic]
  output:
    type: file
    path: "output.json"
```

Run it:

```bash
docetl run pipeline.yaml
```

Or use the interactive playground at [docetl.org/playground](https://docetl.org/playground).

---

## Intro

DocETL is an open-source framework from UC Berkeley's EPIC Lab with 3,700+ GitHub stars for building LLM-powered document processing pipelines. It lets you define complex document analysis workflows declaratively in YAML, using operators like map (process each document), reduce (aggregate groups), resolve (entity resolution), and filter. An accompanying interactive UI called DocWrangler lets you build and test pipelines visually. Backed by peer-reviewed research (published at VLDB 2025), DocETL includes an automatic optimizer that rewrites pipelines for better output quality.

Works with: OpenAI, Anthropic Claude, AWS Bedrock, any LiteLLM-compatible model. Best for researchers and data teams processing large document collections with LLMs. Setup time: under 5 minutes.

---

## DocETL Pipeline Operators

### Core Operators

| Operator | Description | Use Case |
|----------|-------------|----------|
| **map** | Process each document independently | Summarize, extract entities, classify |
| **reduce** | Aggregate multiple documents by key | Create overviews, merge findings |
| **resolve** | Entity resolution across documents | Deduplicate authors, normalize names |
| **filter** | Keep/remove documents by condition | Quality filtering, relevance check |
| **unnest** | Flatten nested arrays | Expand multi-value fields |
| **split** | Break documents into chunks | Handle long documents |
| **gather** | Collect results from parallel branches | Merge pipeline outputs |
| **equijoin** | Join two datasets by key | Combine data sources |

### Map Operator — Process Each Document

```yaml
- name: extract_findings
  type: map
  prompt: |
    Extract the key findings from this paper:
    Title: {{ input.title }}
    Abstract: {{ input.abstract }}
    Full text: {{ input.content }}

    Return structured findings.
  output:
    schema:
      findings: "list[string]"
      methodology: string
      confidence: string
  model: gpt-4o
```

### Reduce Operator — Aggregate by Key

```yaml
- name: synthesize_by_field
  type: reduce
  reduce_key: research_field
  prompt: |
    You are analyzing {{ inputs | length }} papers in {{ reduce_key }}.

    Papers:
    {% for paper in inputs %}
    - {{ paper.title }}: {{ paper.findings | join(', ') }}
    {% endfor %}

    Write a synthesis of the current state of research.
  output:
    schema:
      synthesis: string
      key_trends: "list[string]"
      open_questions: "list[string]"
```

### Resolve Operator — Entity Resolution

```yaml
- name: deduplicate_authors
  type: resolve
  comparison_prompt: |
    Are these the same person?
    Author A: {{ input1.author_name }} from {{ input1.institution }}
    Author B: {{ input2.author_name }} from {{ input2.institution }}
  resolution_prompt: |
    Merge these author records into one canonical record.
  output:
    schema:
      canonical_name: string
      institution: string
```

### Pipeline Optimizer

DocETL includes an automatic optimizer that rewrites your pipeline:

```bash
docetl optimize pipeline.yaml
```

The optimizer can:
- Add **gleaning** (iterative refinement) to improve map quality
- Insert **chunking** for long documents that exceed context limits
- Add **resolve** steps to handle entity inconsistencies
- Parallelize independent operations for speed

### DocWrangler — Interactive UI

The web playground at `docetl.org/playground` provides:
- Visual pipeline builder with drag-and-drop operators
- Real-time output preview for each step
- Prompt iteration and A/B testing
- Export to YAML for production use

### Real-World Applications

| Application | Pipeline Design |
|-------------|----------------|
| **Literature review** | Map (summarize) → Reduce (synthesize by topic) → Map (generate insights) |
| **Contract analysis** | Map (extract clauses) → Filter (flag risky clauses) → Reduce (risk report) |
| **Resume screening** | Map (extract skills) → Resolve (normalize titles) → Filter (match requirements) |
| **Patent analysis** | Map (extract claims) → Reduce (cluster by technology) → Map (novelty assessment) |
| **Survey analysis** | Map (categorize responses) → Reduce (aggregate by theme) → Map (generate report) |

---

## FAQ

**Q: What is DocETL?**
A: DocETL is an open-source framework from UC Berkeley for building LLM-powered document processing pipelines using declarative YAML, with operators like map, reduce, resolve, and filter. 3,700+ GitHub stars, published at VLDB 2025.

**Q: How is DocETL different from LangChain or LlamaIndex?**
A: DocETL is purpose-built for document ETL (Extract, Transform, Load) with declarative pipelines. LangChain/LlamaIndex are general-purpose LLM frameworks. DocETL excels at batch processing hundreds of documents with complex aggregation logic that would be tedious to code manually.

**Q: Is DocETL free?**
A: Yes, fully open-source under MIT license. You bring your own LLM API keys.

---

## Source & Thanks

> Created by [UC Berkeley EPIC Lab](https://github.com/ucbepic). Licensed under MIT.
>
> [docetl](https://github.com/ucbepic/docetl) — ⭐ 3,700+

Thanks to the UC Berkeley EPIC Lab for advancing the science of LLM-powered document processing.

---

<!-- ZH -->

## 快速使用

```bash
pip install docetl
```

创建管线 YAML 文件，定义 map（逐文档处理）、reduce（聚合分组）等操作：

```yaml
operations:
  - name: summarize
    type: map
    prompt: "用3句话总结这篇论文：{{ input.content }}"
    output:
      schema:
        summary: string

pipeline:
  steps:
    - name: summarize_papers
      input: papers
      operations: [summarize]
```

运行：

```bash
docetl run pipeline.yaml
```

---

## 简介

DocETL 是 UC Berkeley EPIC 实验室开源的 LLM 文档处理管线框架，拥有 3,700+ GitHub stars。支持用声明式 YAML 定义复杂文档分析工作流，提供 map（逐个处理）、reduce（分组聚合）、resolve（实体消解）和 filter（过滤）等操作符。配套交互式界面 DocWrangler 可视化构建和测试管线。已在 VLDB 2025 发表学术论文，包含自动优化器可重写管线提升输出质量。

适用于：OpenAI、Anthropic Claude、AWS Bedrock 等任何 LiteLLM 兼容模型。适合用 LLM 处理大量文档集合的研究人员和数据团队。

---

## 核心操作符

### map — 逐文档处理
独立处理每个文档：总结、实体提取、分类。

### reduce — 分组聚合
按关键字聚合多个文档：生成综述、合并发现。

### resolve — 实体消解
跨文档去重和规范化：作者去重、名称标准化。

### filter — 条件过滤
按条件保留或删除文档：质量过滤、相关性检查。

### 自动优化器
自动重写管线，添加迭代精炼、分块和并行化，提升输出质量。

### DocWrangler 可视化界面
在 docetl.org/playground 拖拽构建管线，实时预览每步输出。

---

## 来源与感谢

> Created by [UC Berkeley EPIC Lab](https://github.com/ucbepic). Licensed under MIT.
>
> [docetl](https://github.com/ucbepic/docetl) — ⭐ 3,700+


---
Source: https://tokrepo.com/en/workflows/ef81583e-45e5-4134-b25b-04e486ae2d06
Author: TokRepo精选