Esta página se muestra en inglés. Una traducción al español está en curso.
KnowledgeApr 2, 2026·3 min de lectura

DocETL — LLM-Powered Document Processing Pipelines

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

Introducción

DocETL is an open-source framework from UC Berkeley's EPIC Lab with 3,700+ GitHub stars for building LLM-powered document processing pipelines. It lets you define complex document analysis workflows declaratively in YAML, using operators like map (process each document), reduce (aggregate groups), resolve (entity resolution), and filter. An accompanying interactive UI called DocWrangler lets you build and test pipelines visually. Backed by peer-reviewed research (published at VLDB 2025), DocETL includes an automatic optimizer that rewrites pipelines for better output quality.

Works with: OpenAI, Anthropic Claude, AWS Bedrock, any LiteLLM-compatible model. Best for researchers and data teams processing large document collections with LLMs. Setup time: under 5 minutes.


DocETL Pipeline Operators

Core Operators

Operator Description Use Case
map Process each document independently Summarize, extract entities, classify
reduce Aggregate multiple documents by key Create overviews, merge findings
resolve Entity resolution across documents Deduplicate authors, normalize names
filter Keep/remove documents by condition Quality filtering, relevance check
unnest Flatten nested arrays Expand multi-value fields
split Break documents into chunks Handle long documents
gather Collect results from parallel branches Merge pipeline outputs
equijoin Join two datasets by key Combine data sources

Map Operator — Process Each Document

- name: extract_findings
  type: map
  prompt: |
    Extract the key findings from this paper:
    Title: {{ input.title }}
    Abstract: {{ input.abstract }}
    Full text: {{ input.content }}

    Return structured findings.
  output:
    schema:
      findings: "list[string]"
      methodology: string
      confidence: string
  model: gpt-4o

Reduce Operator — Aggregate by Key

- name: synthesize_by_field
  type: reduce
  reduce_key: research_field
  prompt: |
    You are analyzing {{ inputs | length }} papers in {{ reduce_key }}.

    Papers:
    {% for paper in inputs %}
    - {{ paper.title }}: {{ paper.findings | join(', ') }}
    {% endfor %}

    Write a synthesis of the current state of research.
  output:
    schema:
      synthesis: string
      key_trends: "list[string]"
      open_questions: "list[string]"

Resolve Operator — Entity Resolution

- name: deduplicate_authors
  type: resolve
  comparison_prompt: |
    Are these the same person?
    Author A: {{ input1.author_name }} from {{ input1.institution }}
    Author B: {{ input2.author_name }} from {{ input2.institution }}
  resolution_prompt: |
    Merge these author records into one canonical record.
  output:
    schema:
      canonical_name: string
      institution: string

Pipeline Optimizer

DocETL includes an automatic optimizer that rewrites your pipeline:

docetl optimize pipeline.yaml

The optimizer can:

  • Add gleaning (iterative refinement) to improve map quality
  • Insert chunking for long documents that exceed context limits
  • Add resolve steps to handle entity inconsistencies
  • Parallelize independent operations for speed

DocWrangler — Interactive UI

The web playground at docetl.org/playground provides:

  • Visual pipeline builder with drag-and-drop operators
  • Real-time output preview for each step
  • Prompt iteration and A/B testing
  • Export to YAML for production use

Real-World Applications

Application Pipeline Design
Literature review Map (summarize) → Reduce (synthesize by topic) → Map (generate insights)
Contract analysis Map (extract clauses) → Filter (flag risky clauses) → Reduce (risk report)
Resume screening Map (extract skills) → Resolve (normalize titles) → Filter (match requirements)
Patent analysis Map (extract claims) → Reduce (cluster by technology) → Map (novelty assessment)
Survey analysis Map (categorize responses) → Reduce (aggregate by theme) → Map (generate report)

FAQ

Q: What is DocETL? A: DocETL is an open-source framework from UC Berkeley for building LLM-powered document processing pipelines using declarative YAML, with operators like map, reduce, resolve, and filter. 3,700+ GitHub stars, published at VLDB 2025.

Q: How is DocETL different from LangChain or LlamaIndex? A: DocETL is purpose-built for document ETL (Extract, Transform, Load) with declarative pipelines. LangChain/LlamaIndex are general-purpose LLM frameworks. DocETL excels at batch processing hundreds of documents with complex aggregation logic that would be tedious to code manually.

Q: Is DocETL free? A: Yes, fully open-source under MIT license. You bring your own LLM API keys.


🙏

Fuente y agradecimientos

Created by UC Berkeley EPIC Lab. Licensed under MIT.

docetl — ⭐ 3,700+

Thanks to the UC Berkeley EPIC Lab for advancing the science of LLM-powered document processing.

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados