Cette page est affichée en anglais. Une traduction française est en cours.
MCP ConfigsApr 2, 2026·2 min de lecture

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

Introduction

Unstructured is an open-source document ETL library with 14,400+ GitHub stars that converts complex documents into clean, structured data ready for LLM consumption. It handles PDFs, Word docs, PowerPoint, Excel, HTML, emails, images, and 20+ more formats — extracting text, tables, images, and metadata while preserving document structure. Used as the preprocessing backbone for RAG pipelines, Unstructured bridges the gap between raw documents and AI-ready data. Integrates with LangChain, LlamaIndex, Haystack, and every major RAG framework.

Works with: LangChain, LlamaIndex, Haystack, any RAG framework, any vector database. Best for teams building document-heavy AI applications. Setup time: under 3 minutes.


Supported Formats

Format Extension Features
PDF .pdf OCR, table extraction, image extraction
Word .docx Full formatting, tables, images
PowerPoint .pptx Slides, notes, images
Excel .xlsx Sheets, formulas, charts
HTML .html Clean text extraction, link preservation
Email .eml, .msg Body, attachments, metadata
Markdown .md Headers, code blocks, links
Images .png, .jpg OCR text extraction
EPUB .epub Chapters, metadata
RST .rst ReStructuredText
CSV/TSV .csv, .tsv Tabular data

Element Types

from unstructured.partition.auto import partition

elements = partition("complex_report.pdf")

# Elements are typed:
# Title          - Section headers
# NarrativeText  - Body paragraphs
# ListItem       - Bullet points
# Table          - Tabular data (as HTML or text)
# Image          - Extracted images with descriptions
# FigureCaption  - Image captions
# Header/Footer  - Page headers/footers
# PageBreak      - Page boundaries

Chunking for RAG

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition("document.pdf")

# Chunk by section headers (ideal for RAG)
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    combine_text_under_n_chars=200,
)

for chunk in chunks:
    print(f"Chunk ({len(str(chunk))} chars): {str(chunk)[:80]}...")

LangChain Integration

from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("report.pdf", mode="elements")
docs = loader.load()

# Each element becomes a LangChain Document
for doc in docs:
    print(doc.page_content[:100])
    print(doc.metadata)  # {"source": "report.pdf", "category": "NarrativeText"}

Batch Processing

import os
from unstructured.partition.auto import partition

for filename in os.listdir("documents/"):
    elements = partition(f"documents/{filename}")
    text = "\
\
".join(str(e) for e in elements)
    with open(f"output/{filename}.txt", "w") as f:
        f.write(text)

FAQ

Q: What is Unstructured? A: Unstructured is an open-source document ETL library with 14,400+ GitHub stars that extracts structured data from 20+ document formats (PDF, DOCX, HTML, images) for LLM and RAG pipelines.

Q: How is Unstructured different from MinerU or Docling? A: Unstructured supports the widest range of formats (20+ vs MinerU's PDF focus). MinerU has better layout detection for complex PDFs. Docling (IBM) excels at table extraction. Unstructured is the best all-rounder for heterogeneous document collections.

Q: Is Unstructured free? A: Yes, the open-source library is free under Apache-2.0. Unstructured also offers a hosted API service with a free tier.


🙏

Source et remerciements

Created by Unstructured-IO. Licensed under Apache-2.0.

unstructured — ⭐ 14,400+

Discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.