MCP Configs2026年4月2日·1 分钟阅读

MarkItDown — Convert Any Document to Markdown

Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.

Agent 就绪

这个资产会安全暂存

这个资产会先安全暂存。复制的指令会要求 Agent 读取暂存文件,并在激活脚本、MCP 配置或全局配置前先确认。

Stage only · 17/100策略:需暂存
Agent 入口
任意 MCP/CLI Agent
类型
Mcp Config
安装
Stage only
信任
信任等级:Community
入口
markitdown.md
安全暂存命令
npx -y tokrepo@latest install 55fe10f5-e743-40da-9da0-49fadf6e73c7 --target codex

先暂存文件;激活前需要读取暂存 README 和安装计划。

TL;DR
Microsoft's Python tool that converts documents of any format into clean Markdown for LLM ingestion.
§01

What it is

MarkItDown is an open-source Python tool by Microsoft that converts a wide range of file formats into clean Markdown text. Supported inputs include Word documents, PowerPoint presentations, Excel spreadsheets, PDFs, images (with OCR), audio files (with transcription), and HTML pages. The tool also ships as an MCP server, making it directly usable by AI assistants like Claude Code.

The primary audience is developers building LLM pipelines who need to ingest structured documents without losing formatting context. Instead of writing format-specific parsers, you feed any file to MarkItDown and get consistent Markdown output.

§02

How it saves time or tokens

Raw document formats like .docx or .pptx contain XML markup, styling metadata, and binary blobs that waste tokens when passed to an LLM. MarkItDown strips all non-content data and outputs compact Markdown, keeping headings, lists, tables, and emphasis intact. The token estimate for this workflow is 979 tokens, a fraction of what raw document XML would consume. The MCP server mode eliminates the need for manual conversion steps entirely.

§03

How to use

  1. Install the package:
pip install 'markitdown[all]'
  1. Convert a file from the command line:
markitdown path/to/file.pptx > output.md
markitdown https://example.com/document.pdf > output.md
  1. Use the Python API for programmatic conversion:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert('report.docx')
print(result.text_content)
§04

Example

Using MarkItDown as an MCP server for Claude Code:

{
  "mcpServers": {
    "markitdown": {
      "command": "markitdown-mcp"
    }
  }
}

Once configured, Claude Code can call MarkItDown to convert documents inline during a conversation, extracting text from uploaded files without manual preprocessing.

§05

Related on TokRepo

§06

Common pitfalls

  • Image-heavy PDFs require OCR dependencies (Tesseract); install them separately or use pip install 'markitdown[all]' to include all optional features
  • Audio transcription depends on external speech-to-text APIs; configure API keys before attempting audio conversion
  • Complex table layouts in Excel with merged cells may not convert perfectly; verify output for critical data extraction tasks

常见问题

What file formats does MarkItDown support?+

MarkItDown handles Word (.docx), PowerPoint (.pptx), Excel (.xlsx), PDF, HTML, images (JPEG, PNG with OCR), and audio files (MP3, WAV with transcription). The 'markitdown[all]' install includes all optional dependencies for the full format range.

How does MarkItDown compare to raw document parsing?+

Raw parsing libraries like python-docx or PyPDF give you low-level access to document internals. MarkItDown abstracts that away and outputs clean Markdown directly. You trade fine-grained control for simplicity and consistency across formats.

Can I use MarkItDown as an MCP server?+

Yes. Install the MCP variant with pip install markitdown-mcp, then add it to your .mcp.json configuration. AI assistants like Claude Code can then call MarkItDown tools to convert documents inline during sessions.

Does MarkItDown preserve tables and formatting?+

Yes, tables are converted to Markdown table syntax, headings maintain their hierarchy, and emphasis (bold, italic) is preserved. Complex nested structures may simplify slightly, but the semantic content remains intact.

Is MarkItDown suitable for production LLM pipelines?+

MarkItDown is designed for exactly this use case. Its compact Markdown output minimizes token usage compared to raw document formats. Microsoft maintains the project actively, and it handles batch processing through its Python API.

引用来源 (3)
🙏

来源与感谢

  • GitHub: microsoft/markitdown — 93,000+ stars, MIT License
  • PyPI: markitdown (CLI/library), markitdown-mcp (MCP server)
  • By Microsoft

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产