Is Unstructured — Document ETL for LLM Pipelines free to use?

Yes. Unstructured — Document ETL for LLM Pipelines is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Unstructured — Document ETL for LLM Pipelines?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

MCP ConfigsApr 2, 2026·2 min read

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

MCP Hub · Community

TL;DR

Unstructured parses PDFs, DOCX, HTML, and images into clean data for LLM pipelines.

§01

What it is

Unstructured is an open-source library that extracts and transforms data from unstructured documents into clean, structured formats suitable for LLM processing. It handles PDFs, Word documents, HTML pages, images (via OCR), emails, and many other file types. The library auto-detects document types and applies the appropriate parsing strategy.

It targets developers building RAG (Retrieval-Augmented Generation) pipelines, knowledge bases, and any AI application that needs to ingest real-world documents.

§02

How it saves time or tokens

Unstructured handles the messy work of document parsing that would otherwise require multiple specialized libraries. Instead of writing separate code for PDFs, Word docs, and HTML, you call one function. The output is chunked and cleaned for LLM consumption, reducing token waste from formatting artifacts, headers, and footers. For RAG pipelines, properly chunked documents mean better retrieval accuracy and lower token costs.

§03

How to use

Install the library:

pip install unstructured

Parse any document type:

from unstructured.partition.auto import partition

# Auto-detect and parse any document
elements = partition(filename='report.pdf')

for element in elements:
    print(f'{type(element).__name__}: {str(element)[:100]}')

Use with specific document types:

from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html

# PDF with OCR for scanned documents
elements = partition_pdf('scanned_report.pdf', strategy='ocr_only')

# HTML page
elements = partition_html(url='https://example.com/article')

§04

Example

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

# Parse a PDF report
elements = partition(filename='quarterly_report.pdf')

# Chunk for RAG ingestion
chunks = chunk_by_title(
    elements,
    max_characters=1000,
    combine_text_under_n_chars=200
)

# Each chunk is ready for embedding and vector storage
for chunk in chunks:
    print(f'Type: {chunk.category}')
    print(f'Text: {chunk.text[:200]}')
    print(f'Metadata: {chunk.metadata.to_dict()}')
    print('---')

§05

Related on TokRepo

AI tools for RAG -- RAG pipeline tools and frameworks
AI tools for documents -- Document processing and analysis tools

§06

Common pitfalls

Some document types require extra system dependencies. PDF parsing needs poppler-utils and tesseract for OCR. Install them via your system package manager before using those features.
The 'auto' strategy may not always choose the best parser. For production pipelines, specify the partition function explicitly (partition_pdf, partition_html) and set the strategy parameter.
Large documents can produce thousands of elements. Use the chunking utilities to combine small elements and split large ones before sending to your LLM or vector database.

Frequently Asked Questions

What file types does Unstructured support?+

Unstructured supports PDF, DOCX, PPTX, XLSX, HTML, XML, EML (emails), MSG, RTF, TXT, CSV, TSV, images (PNG, JPG via OCR), and Markdown. The partition() function auto-detects the file type and applies the appropriate parser. Each file type has a dedicated partition function for fine-grained control.

Does Unstructured work with scanned PDFs?+

Yes. Unstructured uses OCR (Tesseract) to extract text from scanned PDFs and images. Set the strategy parameter to 'ocr_only' for fully scanned documents or 'hi_res' for mixed documents with both digital text and scanned sections. Tesseract must be installed on your system.

How does chunking work in Unstructured?+

Unstructured provides chunking utilities like chunk_by_title that group elements by document structure (headings, sections). You set max_characters for chunk size limits and combine_text_under_n_chars to merge small elements. This produces chunks that are semantically coherent and sized for LLM context windows.

Can Unstructured connect to vector databases?+

Yes. Unstructured provides connectors (called 'ingest destinations') for Pinecone, Weaviate, Chroma, Qdrant, Elasticsearch, and others. The pipeline processes documents, chunks them, and writes directly to your vector database. This creates an end-to-end document ETL pipeline.

Is there a hosted version of Unstructured?+

Yes. Unstructured offers a hosted API service that handles parsing without requiring you to manage dependencies and infrastructure. The API accepts documents via HTTP and returns structured elements. The open-source library is free for self-hosted use, while the API has usage-based pricing.

Citations (3)

Unstructured GitHub Repository— Unstructured extracts data from PDFs, DOCX, HTML, images, and emails
Unstructured Documentation— Unstructured provides chunking strategies optimized for RAG pipelines
RAG Survey Paper— Retrieval-Augmented Generation benefits from properly chunked document data

Related on TokRepo

RAG tools Document tools Featured workflows

🙏

Source & Thanks

Created by Unstructured-IO. Licensed under Apache-2.0.

unstructured — ⭐ 14,400+

Discussion

No comments yet. Be the first to share your thoughts.