Unstructured — Document ETL for LLM Pipelines
Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.
What it is
Unstructured is an open-source library that extracts and transforms data from unstructured documents into clean, structured formats suitable for LLM processing. It handles PDFs, Word documents, HTML pages, images (via OCR), emails, and many other file types. The library auto-detects document types and applies the appropriate parsing strategy.
It targets developers building RAG (Retrieval-Augmented Generation) pipelines, knowledge bases, and any AI application that needs to ingest real-world documents.
How it saves time or tokens
Unstructured handles the messy work of document parsing that would otherwise require multiple specialized libraries. Instead of writing separate code for PDFs, Word docs, and HTML, you call one function. The output is chunked and cleaned for LLM consumption, reducing token waste from formatting artifacts, headers, and footers. For RAG pipelines, properly chunked documents mean better retrieval accuracy and lower token costs.
How to use
- Install the library:
pip install unstructured
- Parse any document type:
from unstructured.partition.auto import partition
# Auto-detect and parse any document
elements = partition(filename='report.pdf')
for element in elements:
print(f'{type(element).__name__}: {str(element)[:100]}')
- Use with specific document types:
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html
# PDF with OCR for scanned documents
elements = partition_pdf('scanned_report.pdf', strategy='ocr_only')
# HTML page
elements = partition_html(url='https://example.com/article')
Example
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
# Parse a PDF report
elements = partition(filename='quarterly_report.pdf')
# Chunk for RAG ingestion
chunks = chunk_by_title(
elements,
max_characters=1000,
combine_text_under_n_chars=200
)
# Each chunk is ready for embedding and vector storage
for chunk in chunks:
print(f'Type: {chunk.category}')
print(f'Text: {chunk.text[:200]}')
print(f'Metadata: {chunk.metadata.to_dict()}')
print('---')
Related on TokRepo
- AI tools for RAG -- RAG pipeline tools and frameworks
- AI tools for documents -- Document processing and analysis tools
Common pitfalls
- Some document types require extra system dependencies. PDF parsing needs poppler-utils and tesseract for OCR. Install them via your system package manager before using those features.
- The 'auto' strategy may not always choose the best parser. For production pipelines, specify the partition function explicitly (partition_pdf, partition_html) and set the strategy parameter.
- Large documents can produce thousands of elements. Use the chunking utilities to combine small elements and split large ones before sending to your LLM or vector database.
Frequently Asked Questions
Unstructured supports PDF, DOCX, PPTX, XLSX, HTML, XML, EML (emails), MSG, RTF, TXT, CSV, TSV, images (PNG, JPG via OCR), and Markdown. The partition() function auto-detects the file type and applies the appropriate parser. Each file type has a dedicated partition function for fine-grained control.
Yes. Unstructured uses OCR (Tesseract) to extract text from scanned PDFs and images. Set the strategy parameter to 'ocr_only' for fully scanned documents or 'hi_res' for mixed documents with both digital text and scanned sections. Tesseract must be installed on your system.
Unstructured provides chunking utilities like chunk_by_title that group elements by document structure (headings, sections). You set max_characters for chunk size limits and combine_text_under_n_chars to merge small elements. This produces chunks that are semantically coherent and sized for LLM context windows.
Yes. Unstructured provides connectors (called 'ingest destinations') for Pinecone, Weaviate, Chroma, Qdrant, Elasticsearch, and others. The pipeline processes documents, chunks them, and writes directly to your vector database. This creates an end-to-end document ETL pipeline.
Yes. Unstructured offers a hosted API service that handles parsing without requiring you to manage dependencies and infrastructure. The API accepts documents via HTTP and returns structured elements. The open-source library is free for self-hosted use, while the API has usage-based pricing.
Citations (3)
- Unstructured GitHub Repository— Unstructured extracts data from PDFs, DOCX, HTML, images, and emails
- Unstructured Documentation— Unstructured provides chunking strategies optimized for RAG pipelines
- RAG Survey Paper— Retrieval-Augmented Generation benefits from properly chunked document data
Related on TokRepo
Source & Thanks
Created by Unstructured-IO. Licensed under Apache-2.0.
unstructured — ⭐ 14,400+