Supported Formats
| Format | Extension | Features |
|---|---|---|
| OCR, table extraction, image extraction | ||
| Word | .docx | Full formatting, tables, images |
| PowerPoint | .pptx | Slides, notes, images |
| Excel | .xlsx | Sheets, formulas, charts |
| HTML | .html | Clean text extraction, link preservation |
| .eml, .msg | Body, attachments, metadata | |
| Markdown | .md | Headers, code blocks, links |
| Images | .png, .jpg | OCR text extraction |
| EPUB | .epub | Chapters, metadata |
| RST | .rst | ReStructuredText |
| CSV/TSV | .csv, .tsv | Tabular data |
Element Types
from unstructured.partition.auto import partition
elements = partition("complex_report.pdf")
# Elements are typed:
# Title - Section headers
# NarrativeText - Body paragraphs
# ListItem - Bullet points
# Table - Tabular data (as HTML or text)
# Image - Extracted images with descriptions
# FigureCaption - Image captions
# Header/Footer - Page headers/footers
# PageBreak - Page boundariesChunking for RAG
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
elements = partition("document.pdf")
# Chunk by section headers (ideal for RAG)
chunks = chunk_by_title(
elements,
max_characters=1500,
combine_text_under_n_chars=200,
)
for chunk in chunks:
print(f"Chunk ({len(str(chunk))} chars): {str(chunk)[:80]}...")LangChain Integration
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("report.pdf", mode="elements")
docs = loader.load()
# Each element becomes a LangChain Document
for doc in docs:
print(doc.page_content[:100])
print(doc.metadata) # {"source": "report.pdf", "category": "NarrativeText"}Batch Processing
import os
from unstructured.partition.auto import partition
for filename in os.listdir("documents/"):
elements = partition(f"documents/{filename}")
text = "\
\
".join(str(e) for e in elements)
with open(f"output/{filename}.txt", "w") as f:
f.write(text)FAQ
Q: What is Unstructured? A: Unstructured is an open-source document ETL library with 14,400+ GitHub stars that extracts structured data from 20+ document formats (PDF, DOCX, HTML, images) for LLM and RAG pipelines.
Q: How is Unstructured different from MinerU or Docling? A: Unstructured supports the widest range of formats (20+ vs MinerU's PDF focus). MinerU has better layout detection for complex PDFs. Docling (IBM) excels at table extraction. Unstructured is the best all-rounder for heterogeneous document collections.
Q: Is Unstructured free? A: Yes, the open-source library is free under Apache-2.0. Unstructured also offers a hosted API service with a free tier.