DocETL — LLM-Powered Document Processing Pipelines
Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.
What it is
DocETL is a Python framework for building LLM-powered document processing pipelines using declarative YAML configuration. Developed at UC Berkeley, it provides operators like map, reduce, resolve, filter, and split that chain together into data processing workflows. Each operator can call LLMs for tasks like summarization, extraction, classification, and entity resolution.
DocETL targets researchers, data engineers, and developers who need to process large document collections with LLMs. Instead of writing custom Python scripts for each processing step, you define pipelines in YAML and DocETL handles execution, caching, and error recovery.
How it saves time or tokens
DocETL abstracts the boilerplate of LLM document processing. The map operator processes each document independently (parallelizable), reduce aggregates results, and resolve handles entity deduplication. Built-in caching avoids reprocessing unchanged documents. The declarative approach means you focus on what to extract rather than how to orchestrate API calls.
How to use
- Install DocETL:
pip install docetl. - Create a pipeline YAML file defining your datasets, operations, and output.
- Run the pipeline:
docetl run pipeline.yaml.
Example
# pipeline.yaml
datasets:
papers:
type: file
path: 'papers.json'
operations:
- name: summarize
type: map
prompt: |
Summarize the following paper in 3 sentences:
{{ input.text }}
output:
schema:
summary: string
- name: categorize
type: map
prompt: |
Classify this paper into one category:
{{ input.summary }}
output:
schema:
category: string
- name: aggregate
type: reduce
reduce_key: category
prompt: |
Summarize these papers in the {{ reduce_key }} category:
{% for paper in inputs %}
- {{ paper.summary }}
{% endfor %}
pipeline:
steps:
- input: papers
operations: [summarize, categorize, aggregate]
output:
type: file
path: 'results.json'
Related on TokRepo
- Document AI Tools — Document processing and analysis tools
- RAG Tools — Retrieval and document understanding
Common pitfalls
- LLM costs scale with document collection size. A map operation over 10,000 documents means 10,000 LLM calls. Estimate costs before running on large collections.
- The resolve operator for entity deduplication requires careful prompt engineering. Ambiguous entity descriptions produce poor deduplication results.
- Pipeline YAML syntax errors fail at runtime, not at parse time. Validate your YAML structure before running on expensive datasets.
Frequently Asked Questions
DocETL provides map (process each document), reduce (aggregate groups), resolve (entity deduplication), filter (keep/remove documents), split (chunk documents), gather (collect context), and unnest (flatten nested results).
DocETL works with any OpenAI-compatible API, including OpenAI, Anthropic (via proxy), and local models through Ollama or vLLM. Configure the provider in your pipeline YAML.
Yes. DocETL caches operation results so re-running a pipeline skips already-processed documents. This is useful for iterating on later pipeline stages without reprocessing earlier ones.
DocETL retries failed LLM calls with configurable retry counts and backoff. Failed documents can be logged and skipped rather than stopping the entire pipeline.
DocETL is designed for research and batch processing workflows. For production real-time document processing, you may need additional infrastructure for scaling and monitoring. DocETL excels at exploratory analysis and batch ETL.
Citations (3)
- DocETL GitHub— DocETL provides declarative YAML pipelines for LLM document analysis
- DocETL Documentation— Map, reduce, and resolve operators for document processing
- UC Berkeley EPIC Lab— LLM-powered document processing research at UC Berkeley
Related on TokRepo
Source & Thanks
Created by UC Berkeley EPIC Lab. Licensed under MIT.
docetl — ⭐ 3,700+
Thanks to the UC Berkeley EPIC Lab for advancing the science of LLM-powered document processing.
Discussion
Related Assets
Conda — Cross-Platform Package and Environment Manager
Install, update, and manage packages and isolated environments for Python, R, C/C++, and hundreds of other languages from a single tool.
Sphinx — Python Documentation Generator
Generate professional documentation from reStructuredText and Markdown with cross-references, API autodoc, and multiple output formats.
Neutralinojs — Lightweight Cross-Platform Desktop Apps
Build desktop applications with HTML, CSS, and JavaScript using a tiny native runtime instead of bundling Chromium.