KnowledgeApr 2, 2026·3 min read

DocETL — LLM-Powered Document Processing Pipelines

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

TL;DR
DocETL lets you build LLM-powered document processing pipelines using declarative YAML with map, reduce, and resolve operators.
§01

What it is

DocETL is a Python framework for building LLM-powered document processing pipelines using declarative YAML configuration. Developed at UC Berkeley, it provides operators like map, reduce, resolve, filter, and split that chain together into data processing workflows. Each operator can call LLMs for tasks like summarization, extraction, classification, and entity resolution.

DocETL targets researchers, data engineers, and developers who need to process large document collections with LLMs. Instead of writing custom Python scripts for each processing step, you define pipelines in YAML and DocETL handles execution, caching, and error recovery.

§02

How it saves time or tokens

DocETL abstracts the boilerplate of LLM document processing. The map operator processes each document independently (parallelizable), reduce aggregates results, and resolve handles entity deduplication. Built-in caching avoids reprocessing unchanged documents. The declarative approach means you focus on what to extract rather than how to orchestrate API calls.

§03

How to use

  1. Install DocETL: pip install docetl.
  2. Create a pipeline YAML file defining your datasets, operations, and output.
  3. Run the pipeline: docetl run pipeline.yaml.
§04

Example

# pipeline.yaml
datasets:
  papers:
    type: file
    path: 'papers.json'

operations:
  - name: summarize
    type: map
    prompt: |
      Summarize the following paper in 3 sentences:
      {{ input.text }}
    output:
      schema:
        summary: string

  - name: categorize
    type: map
    prompt: |
      Classify this paper into one category:
      {{ input.summary }}
    output:
      schema:
        category: string

  - name: aggregate
    type: reduce
    reduce_key: category
    prompt: |
      Summarize these papers in the {{ reduce_key }} category:
      {% for paper in inputs %}
      - {{ paper.summary }}
      {% endfor %}

pipeline:
  steps:
    - input: papers
      operations: [summarize, categorize, aggregate]
  output:
    type: file
    path: 'results.json'
§05

Related on TokRepo

§06

Common pitfalls

  • LLM costs scale with document collection size. A map operation over 10,000 documents means 10,000 LLM calls. Estimate costs before running on large collections.
  • The resolve operator for entity deduplication requires careful prompt engineering. Ambiguous entity descriptions produce poor deduplication results.
  • Pipeline YAML syntax errors fail at runtime, not at parse time. Validate your YAML structure before running on expensive datasets.

Frequently Asked Questions

What operators does DocETL provide?+

DocETL provides map (process each document), reduce (aggregate groups), resolve (entity deduplication), filter (keep/remove documents), split (chunk documents), gather (collect context), and unnest (flatten nested results).

Which LLM providers does DocETL support?+

DocETL works with any OpenAI-compatible API, including OpenAI, Anthropic (via proxy), and local models through Ollama or vLLM. Configure the provider in your pipeline YAML.

Can DocETL cache results?+

Yes. DocETL caches operation results so re-running a pipeline skips already-processed documents. This is useful for iterating on later pipeline stages without reprocessing earlier ones.

How does DocETL handle errors?+

DocETL retries failed LLM calls with configurable retry counts and backoff. Failed documents can be logged and skipped rather than stopping the entire pipeline.

Is DocETL suitable for production use?+

DocETL is designed for research and batch processing workflows. For production real-time document processing, you may need additional infrastructure for scaling and monitoring. DocETL excels at exploratory analysis and batch ETL.

Citations (3)
🙏

Source & Thanks

Created by UC Berkeley EPIC Lab. Licensed under MIT.

docetl — ⭐ 3,700+

Thanks to the UC Berkeley EPIC Lab for advancing the science of LLM-powered document processing.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets