Cette page est affichée en anglais. Une traduction française est en cours.
MCP ConfigsApr 2, 2026·2 min de lecture

MarkItDown — Convert Any Document to Markdown

Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.

Prêt pour agents

Staging sûr pour cet actif

Cet actif est d'abord staged. Le prompt copié demande à l'agent d'inspecter les fichiers staged avant d'activer scripts, config MCP ou config globale.

Stage only · 17/100Policy : staging
Surface agent
Tout agent MCP/CLI
Type
Mcp Config
Installation
Stage only
Confiance
Confiance : Community
Point d'entrée
markitdown.md
Commande de staging sûr
npx -y tokrepo@latest install 55fe10f5-e743-40da-9da0-49fadf6e73c7 --target codex

Stage les fichiers d'abord; l'activation exige la revue du README et du plan staged.

TL;DR
Microsoft's Python tool that converts documents of any format into clean Markdown for LLM ingestion.
§01

What it is

MarkItDown is an open-source Python tool by Microsoft that converts a wide range of file formats into clean Markdown text. Supported inputs include Word documents, PowerPoint presentations, Excel spreadsheets, PDFs, images (with OCR), audio files (with transcription), and HTML pages. The tool also ships as an MCP server, making it directly usable by AI assistants like Claude Code.

The primary audience is developers building LLM pipelines who need to ingest structured documents without losing formatting context. Instead of writing format-specific parsers, you feed any file to MarkItDown and get consistent Markdown output.

§02

How it saves time or tokens

Raw document formats like .docx or .pptx contain XML markup, styling metadata, and binary blobs that waste tokens when passed to an LLM. MarkItDown strips all non-content data and outputs compact Markdown, keeping headings, lists, tables, and emphasis intact. The token estimate for this workflow is 979 tokens, a fraction of what raw document XML would consume. The MCP server mode eliminates the need for manual conversion steps entirely.

§03

How to use

  1. Install the package:
pip install 'markitdown[all]'
  1. Convert a file from the command line:
markitdown path/to/file.pptx > output.md
markitdown https://example.com/document.pdf > output.md
  1. Use the Python API for programmatic conversion:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert('report.docx')
print(result.text_content)
§04

Example

Using MarkItDown as an MCP server for Claude Code:

{
  "mcpServers": {
    "markitdown": {
      "command": "markitdown-mcp"
    }
  }
}

Once configured, Claude Code can call MarkItDown to convert documents inline during a conversation, extracting text from uploaded files without manual preprocessing.

§05

Related on TokRepo

§06

Common pitfalls

  • Image-heavy PDFs require OCR dependencies (Tesseract); install them separately or use pip install 'markitdown[all]' to include all optional features
  • Audio transcription depends on external speech-to-text APIs; configure API keys before attempting audio conversion
  • Complex table layouts in Excel with merged cells may not convert perfectly; verify output for critical data extraction tasks

Questions fréquentes

What file formats does MarkItDown support?+

MarkItDown handles Word (.docx), PowerPoint (.pptx), Excel (.xlsx), PDF, HTML, images (JPEG, PNG with OCR), and audio files (MP3, WAV with transcription). The 'markitdown[all]' install includes all optional dependencies for the full format range.

How does MarkItDown compare to raw document parsing?+

Raw parsing libraries like python-docx or PyPDF give you low-level access to document internals. MarkItDown abstracts that away and outputs clean Markdown directly. You trade fine-grained control for simplicity and consistency across formats.

Can I use MarkItDown as an MCP server?+

Yes. Install the MCP variant with pip install markitdown-mcp, then add it to your .mcp.json configuration. AI assistants like Claude Code can then call MarkItDown tools to convert documents inline during sessions.

Does MarkItDown preserve tables and formatting?+

Yes, tables are converted to Markdown table syntax, headings maintain their hierarchy, and emphasis (bold, italic) is preserved. Complex nested structures may simplify slightly, but the semantic content remains intact.

Is MarkItDown suitable for production LLM pipelines?+

MarkItDown is designed for exactly this use case. Its compact Markdown output minimizes token usage compared to raw document formats. Microsoft maintains the project actively, and it handles batch processing through its Python API.

Sources citées (3)
🙏

Source et remerciements

  • GitHub: microsoft/markitdown — 93,000+ stars, MIT License
  • PyPI: markitdown (CLI/library), markitdown-mcp (MCP server)
  • By Microsoft

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires