DocETL Pipeline Operators
Core Operators
| Operator | Description | Use Case |
|---|---|---|
| map | Process each document independently | Summarize, extract entities, classify |
| reduce | Aggregate multiple documents by key | Create overviews, merge findings |
| resolve | Entity resolution across documents | Deduplicate authors, normalize names |
| filter | Keep/remove documents by condition | Quality filtering, relevance check |
| unnest | Flatten nested arrays | Expand multi-value fields |
| split | Break documents into chunks | Handle long documents |
| gather | Collect results from parallel branches | Merge pipeline outputs |
| equijoin | Join two datasets by key | Combine data sources |
Map Operator — Process Each Document
- name: extract_findings
type: map
prompt: |
Extract the key findings from this paper:
Title: {{ input.title }}
Abstract: {{ input.abstract }}
Full text: {{ input.content }}
Return structured findings.
output:
schema:
findings: "list[string]"
methodology: string
confidence: string
model: gpt-4oReduce Operator — Aggregate by Key
- name: synthesize_by_field
type: reduce
reduce_key: research_field
prompt: |
You are analyzing {{ inputs | length }} papers in {{ reduce_key }}.
Papers:
{% for paper in inputs %}
- {{ paper.title }}: {{ paper.findings | join(', ') }}
{% endfor %}
Write a synthesis of the current state of research.
output:
schema:
synthesis: string
key_trends: "list[string]"
open_questions: "list[string]"Resolve Operator — Entity Resolution
- name: deduplicate_authors
type: resolve
comparison_prompt: |
Are these the same person?
Author A: {{ input1.author_name }} from {{ input1.institution }}
Author B: {{ input2.author_name }} from {{ input2.institution }}
resolution_prompt: |
Merge these author records into one canonical record.
output:
schema:
canonical_name: string
institution: stringPipeline Optimizer
DocETL includes an automatic optimizer that rewrites your pipeline:
docetl optimize pipeline.yamlThe optimizer can:
- Add gleaning (iterative refinement) to improve map quality
- Insert chunking for long documents that exceed context limits
- Add resolve steps to handle entity inconsistencies
- Parallelize independent operations for speed
DocWrangler — Interactive UI
The web playground at docetl.org/playground provides:
- Visual pipeline builder with drag-and-drop operators
- Real-time output preview for each step
- Prompt iteration and A/B testing
- Export to YAML for production use
Real-World Applications
| Application | Pipeline Design |
|---|---|
| Literature review | Map (summarize) → Reduce (synthesize by topic) → Map (generate insights) |
| Contract analysis | Map (extract clauses) → Filter (flag risky clauses) → Reduce (risk report) |
| Resume screening | Map (extract skills) → Resolve (normalize titles) → Filter (match requirements) |
| Patent analysis | Map (extract claims) → Reduce (cluster by technology) → Map (novelty assessment) |
| Survey analysis | Map (categorize responses) → Reduce (aggregate by theme) → Map (generate report) |
FAQ
Q: What is DocETL? A: DocETL is an open-source framework from UC Berkeley for building LLM-powered document processing pipelines using declarative YAML, with operators like map, reduce, resolve, and filter. 3,700+ GitHub stars, published at VLDB 2025.
Q: How is DocETL different from LangChain or LlamaIndex? A: DocETL is purpose-built for document ETL (Extract, Transform, Load) with declarative pipelines. LangChain/LlamaIndex are general-purpose LLM frameworks. DocETL excels at batch processing hundreds of documents with complex aggregation logic that would be tedious to code manually.
Q: Is DocETL free? A: Yes, fully open-source under MIT license. You bring your own LLM API keys.