MCP ConfigsApr 2, 2026·2 min read

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

TO
TokRepo精选 · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

```bash pip install unstructured ``` ```python from unstructured.partition.auto import partition # Auto-detect and parse any document elements = partition(filename="report.pdf") for element in elements: print(f"{type(element).__name__}: {str(element)[:100]}") # Output: # Title: Annual Report 2025 # NarrativeText: Revenue grew 15% year-over-year... # Table: | Quarter | Revenue | Growth | # Image: [image description] ``` For more formats install extras: ```bash pip install "unstructured[pdf,docx,pptx,xlsx,epub,md,html]" ``` ---
Intro
Unstructured is an open-source document ETL library with 14,400+ GitHub stars that converts complex documents into clean, structured data ready for LLM consumption. It handles PDFs, Word docs, PowerPoint, Excel, HTML, emails, images, and 20+ more formats — extracting text, tables, images, and metadata while preserving document structure. Used as the preprocessing backbone for RAG pipelines, Unstructured bridges the gap between raw documents and AI-ready data. Integrates with LangChain, LlamaIndex, Haystack, and every major RAG framework. Works with: LangChain, LlamaIndex, Haystack, any RAG framework, any vector database. Best for teams building document-heavy AI applications. Setup time: under 3 minutes. ---
## Supported Formats | Format | Extension | Features | |--------|-----------|----------| | **PDF** | .pdf | OCR, table extraction, image extraction | | **Word** | .docx | Full formatting, tables, images | | **PowerPoint** | .pptx | Slides, notes, images | | **Excel** | .xlsx | Sheets, formulas, charts | | **HTML** | .html | Clean text extraction, link preservation | | **Email** | .eml, .msg | Body, attachments, metadata | | **Markdown** | .md | Headers, code blocks, links | | **Images** | .png, .jpg | OCR text extraction | | **EPUB** | .epub | Chapters, metadata | | **RST** | .rst | ReStructuredText | | **CSV/TSV** | .csv, .tsv | Tabular data | ### Element Types ```python from unstructured.partition.auto import partition elements = partition("complex_report.pdf") # Elements are typed: # Title - Section headers # NarrativeText - Body paragraphs # ListItem - Bullet points # Table - Tabular data (as HTML or text) # Image - Extracted images with descriptions # FigureCaption - Image captions # Header/Footer - Page headers/footers # PageBreak - Page boundaries ``` ### Chunking for RAG ```python from unstructured.partition.auto import partition from unstructured.chunking.title import chunk_by_title elements = partition("document.pdf") # Chunk by section headers (ideal for RAG) chunks = chunk_by_title( elements, max_characters=1500, combine_text_under_n_chars=200, ) for chunk in chunks: print(f"Chunk ({len(str(chunk))} chars): {str(chunk)[:80]}...") ``` ### LangChain Integration ```python from langchain_community.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader("report.pdf", mode="elements") docs = loader.load() # Each element becomes a LangChain Document for doc in docs: print(doc.page_content[:100]) print(doc.metadata) # {"source": "report.pdf", "category": "NarrativeText"} ``` ### Batch Processing ```python import os from unstructured.partition.auto import partition for filename in os.listdir("documents/"): elements = partition(f"documents/{filename}") text = "\n\n".join(str(e) for e in elements) with open(f"output/{filename}.txt", "w") as f: f.write(text) ``` --- ## FAQ **Q: What is Unstructured?** A: Unstructured is an open-source document ETL library with 14,400+ GitHub stars that extracts structured data from 20+ document formats (PDF, DOCX, HTML, images) for LLM and RAG pipelines. **Q: How is Unstructured different from MinerU or Docling?** A: Unstructured supports the widest range of formats (20+ vs MinerU's PDF focus). MinerU has better layout detection for complex PDFs. Docling (IBM) excels at table extraction. Unstructured is the best all-rounder for heterogeneous document collections. **Q: Is Unstructured free?** A: Yes, the open-source library is free under Apache-2.0. Unstructured also offers a hosted API service with a free tier. ---
🙏

Source & Thanks

> Created by [Unstructured-IO](https://github.com/Unstructured-IO). Licensed under Apache-2.0. > > [unstructured](https://github.com/Unstructured-IO/unstructured) — ⭐ 14,400+

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets