Esta página se muestra en inglés. Una traducción al español está en curso.
MCP ConfigsApr 2, 2026·2 min de lectura

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

Listo para agents

Staging seguro para este activo

Este activo primero queda en staging. El prompt copiado pide inspeccionar los archivos staged antes de activar scripts, config MCP o config global.

Stage only · 17/100Política: staging
Superficie agent
Cualquier agent MCP/CLI
Tipo
Mcp Config
Instalación
Stage only
Confianza
Confianza: Established
Entrada
unstructured.md
Comando de staging seguro
npx -y tokrepo@latest install c2ba9909-f624-414f-8aeb-fbd95c50766e --target codex

Primero deja archivos en staging; la activación requiere revisar el README y el plan staged.

TL;DR
Unstructured parses PDFs, DOCX, HTML, and images into clean data for LLM pipelines.
§01

What it is

Unstructured is an open-source library that extracts and transforms data from unstructured documents into clean, structured formats suitable for LLM processing. It handles PDFs, Word documents, HTML pages, images (via OCR), emails, and many other file types. The library auto-detects document types and applies the appropriate parsing strategy.

It targets developers building RAG (Retrieval-Augmented Generation) pipelines, knowledge bases, and any AI application that needs to ingest real-world documents.

§02

How it saves time or tokens

Unstructured handles the messy work of document parsing that would otherwise require multiple specialized libraries. Instead of writing separate code for PDFs, Word docs, and HTML, you call one function. The output is chunked and cleaned for LLM consumption, reducing token waste from formatting artifacts, headers, and footers. For RAG pipelines, properly chunked documents mean better retrieval accuracy and lower token costs.

§03

How to use

  1. Install the library:
pip install unstructured
  1. Parse any document type:
from unstructured.partition.auto import partition

# Auto-detect and parse any document
elements = partition(filename='report.pdf')

for element in elements:
    print(f'{type(element).__name__}: {str(element)[:100]}')
  1. Use with specific document types:
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html

# PDF with OCR for scanned documents
elements = partition_pdf('scanned_report.pdf', strategy='ocr_only')

# HTML page
elements = partition_html(url='https://example.com/article')
§04

Example

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

# Parse a PDF report
elements = partition(filename='quarterly_report.pdf')

# Chunk for RAG ingestion
chunks = chunk_by_title(
    elements,
    max_characters=1000,
    combine_text_under_n_chars=200
)

# Each chunk is ready for embedding and vector storage
for chunk in chunks:
    print(f'Type: {chunk.category}')
    print(f'Text: {chunk.text[:200]}')
    print(f'Metadata: {chunk.metadata.to_dict()}')
    print('---')
§05

Related on TokRepo

§06

Common pitfalls

  • Some document types require extra system dependencies. PDF parsing needs poppler-utils and tesseract for OCR. Install them via your system package manager before using those features.
  • The 'auto' strategy may not always choose the best parser. For production pipelines, specify the partition function explicitly (partition_pdf, partition_html) and set the strategy parameter.
  • Large documents can produce thousands of elements. Use the chunking utilities to combine small elements and split large ones before sending to your LLM or vector database.

Preguntas frecuentes

What file types does Unstructured support?+

Unstructured supports PDF, DOCX, PPTX, XLSX, HTML, XML, EML (emails), MSG, RTF, TXT, CSV, TSV, images (PNG, JPG via OCR), and Markdown. The partition() function auto-detects the file type and applies the appropriate parser. Each file type has a dedicated partition function for fine-grained control.

Does Unstructured work with scanned PDFs?+

Yes. Unstructured uses OCR (Tesseract) to extract text from scanned PDFs and images. Set the strategy parameter to 'ocr_only' for fully scanned documents or 'hi_res' for mixed documents with both digital text and scanned sections. Tesseract must be installed on your system.

How does chunking work in Unstructured?+

Unstructured provides chunking utilities like chunk_by_title that group elements by document structure (headings, sections). You set max_characters for chunk size limits and combine_text_under_n_chars to merge small elements. This produces chunks that are semantically coherent and sized for LLM context windows.

Can Unstructured connect to vector databases?+

Yes. Unstructured provides connectors (called 'ingest destinations') for Pinecone, Weaviate, Chroma, Qdrant, Elasticsearch, and others. The pipeline processes documents, chunks them, and writes directly to your vector database. This creates an end-to-end document ETL pipeline.

Is there a hosted version of Unstructured?+

Yes. Unstructured offers a hosted API service that handles parsing without requiring you to manage dependencies and infrastructure. The API accepts documents via HTTP and returns structured elements. The open-source library is free for self-hosted use, while the API has usage-based pricing.

Referencias (3)
🙏

Fuente y agradecimientos

Created by Unstructured-IO. Licensed under Apache-2.0.

unstructured — ⭐ 14,400+

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados