Is DocETL — LLM-Powered Document Processing Pipelines free to use?

Yes. DocETL — LLM-Powered Document Processing Pipelines is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install DocETL — LLM-Powered Document Processing Pipelines?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

KnowledgeApr 2, 2026·3 min read

DocETL — LLM-Powered Document Processing Pipelines

Declarative YAML pipelines for LLM document analysis with map, reduce, and resolve operators. By UC Berkeley. 3.7K+ stars.

AI Open Source · Community

TL;DR

DocETL lets you build LLM-powered document processing pipelines using declarative YAML with map, reduce, and resolve operators.

§01

What it is

DocETL is a Python framework for building LLM-powered document processing pipelines using declarative YAML configuration. Developed at UC Berkeley, it provides operators like map, reduce, resolve, filter, and split that chain together into data processing workflows. Each operator can call LLMs for tasks like summarization, extraction, classification, and entity resolution.

DocETL targets researchers, data engineers, and developers who need to process large document collections with LLMs. Instead of writing custom Python scripts for each processing step, you define pipelines in YAML and DocETL handles execution, caching, and error recovery.

§02

How it saves time or tokens

DocETL abstracts the boilerplate of LLM document processing. The map operator processes each document independently (parallelizable), reduce aggregates results, and resolve handles entity deduplication. Built-in caching avoids reprocessing unchanged documents. The declarative approach means you focus on what to extract rather than how to orchestrate API calls.

§03

How to use

Install DocETL: pip install docetl.
Create a pipeline YAML file defining your datasets, operations, and output.
Run the pipeline: docetl run pipeline.yaml.

§04

Example

# pipeline.yaml
datasets:
  papers:
    type: file
    path: 'papers.json'

operations:
  - name: summarize
    type: map
    prompt: |
      Summarize the following paper in 3 sentences:
      {{ input.text }}
    output:
      schema:
        summary: string

  - name: categorize
    type: map
    prompt: |
      Classify this paper into one category:
      {{ input.summary }}
    output:
      schema:
        category: string

  - name: aggregate
    type: reduce
    reduce_key: category
    prompt: |
      Summarize these papers in the {{ reduce_key }} category:
      {% for paper in inputs %}
      - {{ paper.summary }}
      {% endfor %}

pipeline:
  steps:
    - input: papers
      operations: [summarize, categorize, aggregate]
  output:
    type: file
    path: 'results.json'

§05

Related on TokRepo

Document AI Tools — Document processing and analysis tools
RAG Tools — Retrieval and document understanding

§06

Common pitfalls

LLM costs scale with document collection size. A map operation over 10,000 documents means 10,000 LLM calls. Estimate costs before running on large collections.
The resolve operator for entity deduplication requires careful prompt engineering. Ambiguous entity descriptions produce poor deduplication results.
Pipeline YAML syntax errors fail at runtime, not at parse time. Validate your YAML structure before running on expensive datasets.

Frequently Asked Questions

What operators does DocETL provide?+

DocETL provides map (process each document), reduce (aggregate groups), resolve (entity deduplication), filter (keep/remove documents), split (chunk documents), gather (collect context), and unnest (flatten nested results).

Which LLM providers does DocETL support?+

DocETL works with any OpenAI-compatible API, including OpenAI, Anthropic (via proxy), and local models through Ollama or vLLM. Configure the provider in your pipeline YAML.

Can DocETL cache results?+

Yes. DocETL caches operation results so re-running a pipeline skips already-processed documents. This is useful for iterating on later pipeline stages without reprocessing earlier ones.

How does DocETL handle errors?+

DocETL retries failed LLM calls with configurable retry counts and backoff. Failed documents can be logged and skipped rather than stopping the entire pipeline.

Is DocETL suitable for production use?+

DocETL is designed for research and batch processing workflows. For production real-time document processing, you may need additional infrastructure for scaling and monitoring. DocETL excels at exploratory analysis and batch ETL.

Citations (3)

DocETL GitHub— DocETL provides declarative YAML pipelines for LLM document analysis
DocETL Documentation— Map, reduce, and resolve operators for document processing
UC Berkeley EPIC Lab— LLM-powered document processing research at UC Berkeley

Related on TokRepo

Document tools RAG tools Featured workflows

🙏

Source & Thanks

Created by UC Berkeley EPIC Lab. Licensed under MIT.

docetl — ⭐ 3,700+

Thanks to the UC Berkeley EPIC Lab for advancing the science of LLM-powered document processing.

Discussion

No comments yet. Be the first to share your thoughts.

DocETL — LLM-Powered Document Processing Pipelines

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

Frequently Asked Questions

Citations (3)

Related on TokRepo

Source & Thanks

Discussion

Related Assets

Conda — Cross-Platform Package and Environment Manager

Sphinx — Python Documentation Generator

Neutralinojs — Lightweight Cross-Platform Desktop Apps