MCP Configs2026年4月2日·1 分钟阅读

Unstructured — Document ETL for LLM Pipelines

Extract clean data from PDFs, DOCX, HTML, images, and emails for RAG and LLM ingestion. 14K+ GitHub stars.

MCP Hub · Community

Agent 就绪

这个资产会安全暂存

这个资产会先安全暂存。复制的指令会要求 Agent 读取暂存文件，并在激活脚本、MCP 配置或全局配置前先确认。

Stage only · 17/100策略：需暂存

Agent 入口

任意 MCP/CLI Agent

类型

Mcp Config

安装

Stage only

信任

信任等级：Established

入口

unstructured.md

安全暂存命令

npx -y tokrepo@latest install c2ba9909-f624-414f-8aeb-fbd95c50766e --target codex

先暂存文件；激活前需要读取暂存 README 和安装计划。

TL;DR

Unstructured parses PDFs, DOCX, HTML, and images into clean data for LLM pipelines.

§01

What it is

Unstructured is an open-source library that extracts and transforms data from unstructured documents into clean, structured formats suitable for LLM processing. It handles PDFs, Word documents, HTML pages, images (via OCR), emails, and many other file types. The library auto-detects document types and applies the appropriate parsing strategy.

It targets developers building RAG (Retrieval-Augmented Generation) pipelines, knowledge bases, and any AI application that needs to ingest real-world documents.

§02

How it saves time or tokens

Unstructured handles the messy work of document parsing that would otherwise require multiple specialized libraries. Instead of writing separate code for PDFs, Word docs, and HTML, you call one function. The output is chunked and cleaned for LLM consumption, reducing token waste from formatting artifacts, headers, and footers. For RAG pipelines, properly chunked documents mean better retrieval accuracy and lower token costs.

§03

How to use

Install the library:

pip install unstructured

Parse any document type:

from unstructured.partition.auto import partition

# Auto-detect and parse any document
elements = partition(filename='report.pdf')

for element in elements:
    print(f'{type(element).__name__}: {str(element)[:100]}')

Use with specific document types:

from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html

# PDF with OCR for scanned documents
elements = partition_pdf('scanned_report.pdf', strategy='ocr_only')

# HTML page
elements = partition_html(url='https://example.com/article')

§04

Example

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

# Parse a PDF report
elements = partition(filename='quarterly_report.pdf')

# Chunk for RAG ingestion
chunks = chunk_by_title(
    elements,
    max_characters=1000,
    combine_text_under_n_chars=200
)

# Each chunk is ready for embedding and vector storage
for chunk in chunks:
    print(f'Type: {chunk.category}')
    print(f'Text: {chunk.text[:200]}')
    print(f'Metadata: {chunk.metadata.to_dict()}')
    print('---')

§05

Related on TokRepo

AI tools for RAG -- RAG pipeline tools and frameworks
AI tools for documents -- Document processing and analysis tools

§06

Common pitfalls

Some document types require extra system dependencies. PDF parsing needs poppler-utils and tesseract for OCR. Install them via your system package manager before using those features.
The 'auto' strategy may not always choose the best parser. For production pipelines, specify the partition function explicitly (partition_pdf, partition_html) and set the strategy parameter.
Large documents can produce thousands of elements. Use the chunking utilities to combine small elements and split large ones before sending to your LLM or vector database.

常见问题

What file types does Unstructured support?+

Unstructured supports PDF, DOCX, PPTX, XLSX, HTML, XML, EML (emails), MSG, RTF, TXT, CSV, TSV, images (PNG, JPG via OCR), and Markdown. The partition() function auto-detects the file type and applies the appropriate parser. Each file type has a dedicated partition function for fine-grained control.

Does Unstructured work with scanned PDFs?+

Yes. Unstructured uses OCR (Tesseract) to extract text from scanned PDFs and images. Set the strategy parameter to 'ocr_only' for fully scanned documents or 'hi_res' for mixed documents with both digital text and scanned sections. Tesseract must be installed on your system.

How does chunking work in Unstructured?+

Unstructured provides chunking utilities like chunk_by_title that group elements by document structure (headings, sections). You set max_characters for chunk size limits and combine_text_under_n_chars to merge small elements. This produces chunks that are semantically coherent and sized for LLM context windows.

Can Unstructured connect to vector databases?+

Yes. Unstructured provides connectors (called 'ingest destinations') for Pinecone, Weaviate, Chroma, Qdrant, Elasticsearch, and others. The pipeline processes documents, chunks them, and writes directly to your vector database. This creates an end-to-end document ETL pipeline.

Is there a hosted version of Unstructured?+

Yes. Unstructured offers a hosted API service that handles parsing without requiring you to manage dependencies and infrastructure. The API accepts documents via HTTP and returns structured elements. The open-source library is free for self-hosted use, while the API has usage-based pricing.

引用来源 (3)

Unstructured GitHub Repository— Unstructured extracts data from PDFs, DOCX, HTML, images, and emails
Unstructured Documentation— Unstructured provides chunking strategies optimized for RAG pipelines
RAG Survey Paper— Retrieval-Augmented Generation benefits from properly chunked document data

🙏

来源与感谢

Created by Unstructured-IO. Licensed under Apache-2.0.

unstructured — ⭐ 14,400+

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Unstructured — Document ETL for LLM Pipelines

这个资产会安全暂存

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

TUUI — Desktop MCP Client for Local Toolchains

sp500-mcp-server — Query S&P 500 Company Data via MCP

PageIndex — Document Index for Reasoning-Based RAG

Kreuzberg — Polyglot Document Intelligence Framework with a Rust Core