Knowledge2026年4月2日·1 分钟阅读

DocETL — LLM-Powered Document Processing Pipelines

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

Agent 就绪

先审查再安装

这个资产需要先审查。复制的指令会要求 Agent dry-run、列出写入项,确认后再继续。

Needs Confirmation · 64/100策略:需确认
Agent 入口
任意 MCP/CLI Agent
类型
Knowledge
安装
Single
信任
信任等级:Established
入口
docetl.md
先审查命令
npx -y tokrepo@latest install ef81583e-45e5-4134-b25b-04e486ae2d06 --target codex

先 dry-run,确认写入项后再运行此命令。

TL;DR
DocETL lets you build LLM-powered document processing pipelines using declarative YAML with map, reduce, and resolve operators.
§01

What it is

DocETL is a Python framework for building LLM-powered document processing pipelines using declarative YAML configuration. Developed at UC Berkeley, it provides operators like map, reduce, resolve, filter, and split that chain together into data processing workflows. Each operator can call LLMs for tasks like summarization, extraction, classification, and entity resolution.

DocETL targets researchers, data engineers, and developers who need to process large document collections with LLMs. Instead of writing custom Python scripts for each processing step, you define pipelines in YAML and DocETL handles execution, caching, and error recovery.

§02

How it saves time or tokens

DocETL abstracts the boilerplate of LLM document processing. The map operator processes each document independently (parallelizable), reduce aggregates results, and resolve handles entity deduplication. Built-in caching avoids reprocessing unchanged documents. The declarative approach means you focus on what to extract rather than how to orchestrate API calls.

§03

How to use

  1. Install DocETL: pip install docetl.
  2. Create a pipeline YAML file defining your datasets, operations, and output.
  3. Run the pipeline: docetl run pipeline.yaml.
§04

Example

# pipeline.yaml
datasets:
  papers:
    type: file
    path: 'papers.json'

operations:
  - name: summarize
    type: map
    prompt: |
      Summarize the following paper in 3 sentences:
      {{ input.text }}
    output:
      schema:
        summary: string

  - name: categorize
    type: map
    prompt: |
      Classify this paper into one category:
      {{ input.summary }}
    output:
      schema:
        category: string

  - name: aggregate
    type: reduce
    reduce_key: category
    prompt: |
      Summarize these papers in the {{ reduce_key }} category:
      {% for paper in inputs %}
      - {{ paper.summary }}
      {% endfor %}

pipeline:
  steps:
    - input: papers
      operations: [summarize, categorize, aggregate]
  output:
    type: file
    path: 'results.json'
§05

Related on TokRepo

§06

Common pitfalls

  • LLM costs scale with document collection size. A map operation over 10,000 documents means 10,000 LLM calls. Estimate costs before running on large collections.
  • The resolve operator for entity deduplication requires careful prompt engineering. Ambiguous entity descriptions produce poor deduplication results.
  • Pipeline YAML syntax errors fail at runtime, not at parse time. Validate your YAML structure before running on expensive datasets.

常见问题

What operators does DocETL provide?+

DocETL provides map (process each document), reduce (aggregate groups), resolve (entity deduplication), filter (keep/remove documents), split (chunk documents), gather (collect context), and unnest (flatten nested results).

Which LLM providers does DocETL support?+

DocETL works with any OpenAI-compatible API, including OpenAI, Anthropic (via proxy), and local models through Ollama or vLLM. Configure the provider in your pipeline YAML.

Can DocETL cache results?+

Yes. DocETL caches operation results so re-running a pipeline skips already-processed documents. This is useful for iterating on later pipeline stages without reprocessing earlier ones.

How does DocETL handle errors?+

DocETL retries failed LLM calls with configurable retry counts and backoff. Failed documents can be logged and skipped rather than stopping the entire pipeline.

Is DocETL suitable for production use?+

DocETL is designed for research and batch processing workflows. For production real-time document processing, you may need additional infrastructure for scaling and monitoring. DocETL excels at exploratory analysis and batch ETL.

引用来源 (3)
🙏

来源与感谢

Created by UC Berkeley EPIC Lab. Licensed under MIT.

docetl — ⭐ 3,700+

Thanks to the UC Berkeley EPIC Lab for advancing the science of LLM-powered document processing.

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产