MCP ConfigsApr 2, 2026·2 min read
Unstructured — Document ETL for LLM Pipelines
Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.
TO
TokRepo精选 · Community
Quick Use
Use it first, then decide how deep to go
This block should tell both the user and the agent what to copy, install, and apply first.
```bash
pip install unstructured
```
```python
from unstructured.partition.auto import partition
# Auto-detect and parse any document
elements = partition(filename="report.pdf")
for element in elements:
print(f"{type(element).__name__}: {str(element)[:100]}")
# Output:
# Title: Annual Report 2025
# NarrativeText: Revenue grew 15% year-over-year...
# Table: | Quarter | Revenue | Growth |
# Image: [image description]
```
For more formats install extras:
```bash
pip install "unstructured[pdf,docx,pptx,xlsx,epub,md,html]"
```
---
Intro
Unstructured is an open-source document ETL library with 14,400+ GitHub stars that converts complex documents into clean, structured data ready for LLM consumption. It handles PDFs, Word docs, PowerPoint, Excel, HTML, emails, images, and 20+ more formats — extracting text, tables, images, and metadata while preserving document structure. Used as the preprocessing backbone for RAG pipelines, Unstructured bridges the gap between raw documents and AI-ready data. Integrates with LangChain, LlamaIndex, Haystack, and every major RAG framework.
Works with: LangChain, LlamaIndex, Haystack, any RAG framework, any vector database. Best for teams building document-heavy AI applications. Setup time: under 3 minutes.
---
🙏
Source & Thanks
> Created by [Unstructured-IO](https://github.com/Unstructured-IO). Licensed under Apache-2.0.
>
> [unstructured](https://github.com/Unstructured-IO/unstructured) — ⭐ 14,400+
Discussion
Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.
Related Assets
OpenLIT — OpenTelemetry LLM Observability
Monitor LLM costs, latency, and quality with OpenTelemetry-native tracing. GPU monitoring and guardrails built in. 2.3K+ stars.
TokRepo精选
Agenta — Open-Source LLMOps Platform
Prompt playground, evaluation, and observability in one platform. Compare prompts, run evals, trace production calls. 4K+ stars.
TokRepo精选
Rerun — Visualize Multimodal AI Data in Real-Time
SDK for logging, storing, and visualizing 3D, images, time series, and text in real-time. Built for robotics and AI. 10K+ stars.
TokRepo精选