What is Unstructured — Document ETL for LLM Pipelines?

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

Is Unstructured — Document ETL for LLM Pipelines free to use?

Yes. Unstructured — Document ETL for LLM Pipelines is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Unstructured — Document ETL for LLM Pipelines?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Unstructured — Document ETL for LLM Pipelines

Unstructured is an open-source document ETL library with 14,400+ GitHub stars that converts complex documents into clean, structured data ready for LLM consumption. It handles PDFs, Word docs, PowerPoint, Excel, HTML, emails, images, and 20+ more formats — extracting text, tables, images, and metadata while preserving document structure. Used as the preprocessing backbone for RAG pipelines, Unstructured bridges the gap between raw documents and AI-ready data. Integrates with LangChain, LlamaIndex, Haystack, and every major RAG framework. Works with: LangChain, LlamaIndex, Haystack, any RAG framework, any vector database. Best for teams building document-heavy AI applications. Setup time: under 3 minutes. ---

## Supported Formats | Format | Extension | Features | |--------|-----------|----------| | **PDF** | .pdf | OCR, table extraction, image extraction | | **Word** | .docx | Full formatting, tables, images | | **PowerPoint** | .pptx | Slides, notes, images | | **Excel** | .xlsx | Sheets, formulas, charts | | **HTML** | .html | Clean text extraction, link preservation | | **Email** | .eml, .msg | Body, attachments, metadata | | **Markdown** | .md | Headers, code blocks, links | | **Images** | .png, .jpg | OCR text extraction | | **EPUB** | .epub | Chapters, metadata | | **RST** | .rst | ReStructuredText | | **CSV/TSV** | .csv, .tsv | Tabular data | ### Element Types ```python from unstructured.partition.auto import partition elements = partition("complex_report.pdf") # Elements are typed: # Title - Section headers # NarrativeText - Body paragraphs # ListItem - Bullet points # Table - Tabular data (as HTML or text) # Image - Extracted images with descriptions # FigureCaption - Image captions # Header/Footer - Page headers/footers # PageBreak - Page boundaries ``` ### Chunking for RAG ```python from unstructured.partition.auto import partition from unstructured.chunking.title import chunk_by_title elements = partition("document.pdf") # Chunk by section headers (ideal for RAG) chunks = chunk_by_title( elements, max_characters=1500, combine_text_under_n_chars=200, ) for chunk in chunks: print(f"Chunk ({len(str(chunk))} chars): {str(chunk)[:80]}...") ``` ### LangChain Integration ```python from langchain_community.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader("report.pdf", mode="elements") docs = loader.load() # Each element becomes a LangChain Document for doc in docs: print(doc.page_content[:100]) print(doc.metadata) # {"source": "report.pdf", "category": "NarrativeText"} ``` ### Batch Processing ```python import os from unstructured.partition.auto import partition for filename in os.listdir("documents/"): elements = partition(f"documents/{filename}") text = "\n\n".join(str(e) for e in elements) with open(f"output/{filename}.txt", "w") as f: f.write(text) ``` --- ## FAQ **Q: What is Unstructured?** A: Unstructured is an open-source document ETL library with 14,400+ GitHub stars that extracts structured data from 20+ document formats (PDF, DOCX, HTML, images) for LLM and RAG pipelines. **Q: How is Unstructured different from MinerU or Docling?** A: Unstructured supports the widest range of formats (20+ vs MinerU's PDF focus). MinerU has better layout detection for complex PDFs. Docling (IBM) excels at table extraction. Unstructured is the best all-rounder for heterogeneous document collections. **Q: Is Unstructured free?** A: Yes, the open-source library is free under Apache-2.0. Unstructured also offers a hosted API service with a free tier. ---

Unstructured — Document ETL for LLM Pipelines

Use it first, then decide how deep to go

Source & Thanks

Discussion

Related Assets

OpenLIT — OpenTelemetry LLM Observability

Agenta — Open-Source LLMOps Platform

Rerun — Visualize Multimodal AI Data in Real-Time