Esta página se muestra en inglés. Una traducción al español está en curso.

MCP ConfigsApr 2, 2026·2 min de lectura

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

Listo para agents

Staging seguro para este activo

Este activo primero queda en staging. El prompt copiado pide inspeccionar los archivos staged antes de activar scripts, config MCP o config global.

Stage only · 17/100Política: staging

Superficie agent

Cualquier agent MCP/CLI

Tipo

Mcp Config

Instalación

Stage only

Confianza

Confianza: Established

Entrada

unstructured.md

Comando de staging seguro

npx -y tokrepo@latest install c2ba9909-f624-414f-8aeb-fbd95c50766e --target codex

Primero deja archivos en staging; la activación requiere revisar el README y el plan staged.

TL;DR

Unstructured parses PDFs, DOCX, HTML, and images into clean data for LLM pipelines.

§01

What it is

Unstructured is an open-source library that extracts and transforms data from unstructured documents into clean, structured formats suitable for LLM processing. It handles PDFs, Word documents, HTML pages, images (via OCR), emails, and many other file types. The library auto-detects document types and applies the appropriate parsing strategy.

It targets developers building RAG (Retrieval-Augmented Generation) pipelines, knowledge bases, and any AI application that needs to ingest real-world documents.

§02

How it saves time or tokens

Unstructured handles the messy work of document parsing that would otherwise require multiple specialized libraries. Instead of writing separate code for PDFs, Word docs, and HTML, you call one function. The output is chunked and cleaned for LLM consumption, reducing token waste from formatting artifacts, headers, and footers. For RAG pipelines, properly chunked documents mean better retrieval accuracy and lower token costs.

§03

How to use

Install the library:

pip install unstructured

Parse any document type:

from unstructured.partition.auto import partition

# Auto-detect and parse any document
elements = partition(filename='report.pdf')

for element in elements:
    print(f'{type(element).__name__}: {str(element)[:100]}')

Use with specific document types:

from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html

# PDF with OCR for scanned documents
elements = partition_pdf('scanned_report.pdf', strategy='ocr_only')

# HTML page
elements = partition_html(url='https://example.com/article')

§04

Example

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

# Parse a PDF report
elements = partition(filename='quarterly_report.pdf')

# Chunk for RAG ingestion
chunks = chunk_by_title(
    elements,
    max_characters=1000,
    combine_text_under_n_chars=200
)

# Each chunk is ready for embedding and vector storage
for chunk in chunks:
    print(f'Type: {chunk.category}')
    print(f'Text: {chunk.text[:200]}')
    print(f'Metadata: {chunk.metadata.to_dict()}')
    print('---')

§05

Related on TokRepo

AI tools for RAG -- RAG pipeline tools and frameworks
AI tools for documents -- Document processing and analysis tools

§06

Common pitfalls

Some document types require extra system dependencies. PDF parsing needs poppler-utils and tesseract for OCR. Install them via your system package manager before using those features.
The 'auto' strategy may not always choose the best parser. For production pipelines, specify the partition function explicitly (partition_pdf, partition_html) and set the strategy parameter.
Large documents can produce thousands of elements. Use the chunking utilities to combine small elements and split large ones before sending to your LLM or vector database.

Preguntas frecuentes

What file types does Unstructured support?+

Unstructured supports PDF, DOCX, PPTX, XLSX, HTML, XML, EML (emails), MSG, RTF, TXT, CSV, TSV, images (PNG, JPG via OCR), and Markdown. The partition() function auto-detects the file type and applies the appropriate parser. Each file type has a dedicated partition function for fine-grained control.

Does Unstructured work with scanned PDFs?+

Yes. Unstructured uses OCR (Tesseract) to extract text from scanned PDFs and images. Set the strategy parameter to 'ocr_only' for fully scanned documents or 'hi_res' for mixed documents with both digital text and scanned sections. Tesseract must be installed on your system.

How does chunking work in Unstructured?+

Unstructured provides chunking utilities like chunk_by_title that group elements by document structure (headings, sections). You set max_characters for chunk size limits and combine_text_under_n_chars to merge small elements. This produces chunks that are semantically coherent and sized for LLM context windows.

Can Unstructured connect to vector databases?+

Yes. Unstructured provides connectors (called 'ingest destinations') for Pinecone, Weaviate, Chroma, Qdrant, Elasticsearch, and others. The pipeline processes documents, chunks them, and writes directly to your vector database. This creates an end-to-end document ETL pipeline.

Is there a hosted version of Unstructured?+

Yes. Unstructured offers a hosted API service that handles parsing without requiring you to manage dependencies and infrastructure. The API accepts documents via HTTP and returns structured elements. The open-source library is free for self-hosted use, while the API has usage-based pricing.

Referencias (3)

Unstructured GitHub Repository— Unstructured extracts data from PDFs, DOCX, HTML, images, and emails
Unstructured Documentation— Unstructured provides chunking strategies optimized for RAG pipelines
RAG Survey Paper— Retrieval-Augmented Generation benefits from properly chunked document data

Relacionados en TokRepo

RAG tools Document tools Featured workflows

🙏

Fuente y agradecimientos

Created by Unstructured-IO. Licensed under Apache-2.0.

unstructured — ⭐ 14,400+

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

TUUI — Desktop MCP Client for Local Toolchains

TUUI is a desktop MCP client for orchestrating tools and cross-vendor LLM APIs; verified 1147★ and pushed 2026-05-14.

MCP Configs

MCP Hub

Pathway — Python ETL Framework for Stream Processing and RAG

A real-time data processing framework for Python that unifies batch and streaming ETL pipelines, with built-in connectors for LLM applications and retrieval-augmented generation.

Scripts

Script Depot

sp500-mcp-server — Query S&P 500 Company Data via MCP

sp500-mcp-server is an MCP server + Next.js app for querying S&P 500 company fundamentals, officers, filings, and news via Supabase.

MCP Configs

MCP Hub

PageIndex — Document Index for Reasoning-Based RAG

A document indexing system that enables vectorless retrieval-augmented generation by building structured page-level indexes for LLM reasoning.

Skills

AI Open Source