What is DocETL — LLM-Powered Document Processing Pipelines?

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

Is DocETL — LLM-Powered Document Processing Pipelines free to use?

Yes. DocETL — LLM-Powered Document Processing Pipelines is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install DocETL — LLM-Powered Document Processing Pipelines?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

DocETL — LLM-Powered Document Processing Pipelines

pip install docetl

Create a pipeline YAML file (pipeline.yaml):

datasets:
  papers:
    type: file
    path: "papers.json"

operations:
  - name: summarize
    type: map
    prompt: |
      Summarize the following research paper in 3 sentences:
      {{ input.content }}
    output:
      schema:
        summary: string

  - name: group_by_topic
    type: reduce
    reduce_key: topic
    prompt: |
      Given these paper summaries about {{ reduce_key }}:
      {% for item in inputs %}
      - {{ item.summary }}
      {% endfor %}
      Write a comprehensive overview of this research area.
    output:
      schema:
        overview: string

pipeline:
  steps:
    - name: summarize_papers
      input: papers
      operations: [summarize]
    - name: create_overview
      input: summarize_papers
      operations: [group_by_topic]
  output:
    type: file
    path: "output.json"

Run it:

docetl run pipeline.yaml

Or use the interactive playground at docetl.org/playground.

DocETL Pipeline Operators

Core Operators

Operator	Description	Use Case
map	Process each document independently	Summarize, extract entities, classify
reduce	Aggregate multiple documents by key	Create overviews, merge findings
resolve	Entity resolution across documents	Deduplicate authors, normalize names
filter	Keep/remove documents by condition	Quality filtering, relevance check
unnest	Flatten nested arrays	Expand multi-value fields
split	Break documents into chunks	Handle long documents
gather	Collect results from parallel branches	Merge pipeline outputs
equijoin	Join two datasets by key	Combine data sources

Map Operator — Process Each Document

- name: extract_findings
  type: map
  prompt: |
    Extract the key findings from this paper:
    Title: {{ input.title }}
    Abstract: {{ input.abstract }}
    Full text: {{ input.content }}

    Return structured findings.
  output:
    schema:
      findings: "list[string]"
      methodology: string
      confidence: string
  model: gpt-4o

Reduce Operator — Aggregate by Key

- name: synthesize_by_field
  type: reduce
  reduce_key: research_field
  prompt: |
    You are analyzing {{ inputs | length }} papers in {{ reduce_key }}.

    Papers:
    {% for paper in inputs %}
    - {{ paper.title }}: {{ paper.findings | join(', ') }}
    {% endfor %}

    Write a synthesis of the current state of research.
  output:
    schema:
      synthesis: string
      key_trends: "list[string]"
      open_questions: "list[string]"

Resolve Operator — Entity Resolution

- name: deduplicate_authors
  type: resolve
  comparison_prompt: |
    Are these the same person?
    Author A: {{ input1.author_name }} from {{ input1.institution }}
    Author B: {{ input2.author_name }} from {{ input2.institution }}
  resolution_prompt: |
    Merge these author records into one canonical record.
  output:
    schema:
      canonical_name: string
      institution: string

Pipeline Optimizer

DocETL includes an automatic optimizer that rewrites your pipeline:

docetl optimize pipeline.yaml

The optimizer can:

Add gleaning (iterative refinement) to improve map quality
Insert chunking for long documents that exceed context limits
Add resolve steps to handle entity inconsistencies
Parallelize independent operations for speed

DocWrangler — Interactive UI

The web playground at docetl.org/playground provides:

Visual pipeline builder with drag-and-drop operators
Real-time output preview for each step
Prompt iteration and A/B testing
Export to YAML for production use

Real-World Applications

Application	Pipeline Design
Literature review	Map (summarize) → Reduce (synthesize by topic) → Map (generate insights)
Contract analysis	Map (extract clauses) → Filter (flag risky clauses) → Reduce (risk report)
Resume screening	Map (extract skills) → Resolve (normalize titles) → Filter (match requirements)
Patent analysis	Map (extract claims) → Reduce (cluster by technology) → Map (novelty assessment)
Survey analysis	Map (categorize responses) → Reduce (aggregate by theme) → Map (generate report)

FAQ

Q: What is DocETL? A: DocETL is an open-source framework from UC Berkeley for building LLM-powered document processing pipelines using declarative YAML, with operators like map, reduce, resolve, and filter. 3,700+ GitHub stars, published at VLDB 2025.

Q: How is DocETL different from LangChain or LlamaIndex? A: DocETL is purpose-built for document ETL (Extract, Transform, Load) with declarative pipelines. LangChain/LlamaIndex are general-purpose LLM frameworks. DocETL excels at batch processing hundreds of documents with complex aggregation logic that would be tedious to code manually.

Q: Is DocETL free? A: Yes, fully open-source under MIT license. You bring your own LLM API keys.

DocETL — LLM-Powered Document Processing Pipelines

Use it first, then decide how deep to go

DocETL Pipeline Operators

Core Operators

Map Operator — Process Each Document

Reduce Operator — Aggregate by Key

Resolve Operator — Entity Resolution

Pipeline Optimizer

DocWrangler — Interactive UI

Real-World Applications

FAQ

Source & Thanks

Discussion

Related Assets

Vercel AI SDK — Build AI Apps with React and Next.js

Chroma — Open-Source Embedding Database for AI

Qdrant — Vector Search Engine for AI Applications