What is Unstructured — Document ETL for LLM Pipelines?

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

Is Unstructured — Document ETL for LLM Pipelines free to use?

Yes. Unstructured — Document ETL for LLM Pipelines is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Unstructured — Document ETL for LLM Pipelines?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Unstructured — Document ETL for LLM Pipelines

Unstructured is an open-source document ETL library with 14,400+ GitHub stars that converts complex documents into clean, structured data ready for LLM consumption. It handles PDFs, Word docs, PowerPoint, Excel, HTML, emails, images, and 20+ more formats — extracting text, tables, images, and metadata while preserving document structure. Used as the preprocessing backbone for RAG pipelines, Unstructured bridges the gap between raw documents and AI-ready data. Integrates with LangChain, LlamaIndex, Haystack, and every major RAG framework.

Works with: LangChain, LlamaIndex, Haystack, any RAG framework, any vector database. Best for teams building document-heavy AI applications. Setup time: under 3 minutes.

Supported Formats

Format	Extension	Features
PDF	.pdf	OCR, table extraction, image extraction
Word	.docx	Full formatting, tables, images
PowerPoint	.pptx	Slides, notes, images
Excel	.xlsx	Sheets, formulas, charts
HTML	.html	Clean text extraction, link preservation
Email	.eml, .msg	Body, attachments, metadata
Markdown	.md	Headers, code blocks, links
Images	.png, .jpg	OCR text extraction
EPUB	.epub	Chapters, metadata
RST	.rst	ReStructuredText
CSV/TSV	.csv, .tsv	Tabular data

Element Types

from unstructured.partition.auto import partition

elements = partition("complex_report.pdf")

# Elements are typed:
# Title          - Section headers
# NarrativeText  - Body paragraphs
# ListItem       - Bullet points
# Table          - Tabular data (as HTML or text)
# Image          - Extracted images with descriptions
# FigureCaption  - Image captions
# Header/Footer  - Page headers/footers
# PageBreak      - Page boundaries

Chunking for RAG

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition("document.pdf")

# Chunk by section headers (ideal for RAG)
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    combine_text_under_n_chars=200,
)

for chunk in chunks:
    print(f"Chunk ({len(str(chunk))} chars): {str(chunk)[:80]}...")

LangChain Integration

from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("report.pdf", mode="elements")
docs = loader.load()

# Each element becomes a LangChain Document
for doc in docs:
    print(doc.page_content[:100])
    print(doc.metadata)  # {"source": "report.pdf", "category": "NarrativeText"}

Batch Processing

import os
from unstructured.partition.auto import partition

for filename in os.listdir("documents/"):
    elements = partition(f"documents/{filename}")
    text = "\
\
".join(str(e) for e in elements)
    with open(f"output/{filename}.txt", "w") as f:
        f.write(text)

FAQ

Q: What is Unstructured? A: Unstructured is an open-source document ETL library with 14,400+ GitHub stars that extracts structured data from 20+ document formats (PDF, DOCX, HTML, images) for LLM and RAG pipelines.

Q: How is Unstructured different from MinerU or Docling? A: Unstructured supports the widest range of formats (20+ vs MinerU's PDF focus). MinerU has better layout detection for complex PDFs. Docling (IBM) excels at table extraction. Unstructured is the best all-rounder for heterogeneous document collections.

Q: Is Unstructured free? A: Yes, the open-source library is free under Apache-2.0. Unstructured also offers a hosted API service with a free tier.