MCP ConfigsApr 2, 2026·2 min read

MarkItDown — Convert Any Document to Markdown

Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.

TL;DR
Microsoft's Python tool that converts documents of any format into clean Markdown for LLM ingestion.
§01

What it is

MarkItDown is an open-source Python tool by Microsoft that converts a wide range of file formats into clean Markdown text. Supported inputs include Word documents, PowerPoint presentations, Excel spreadsheets, PDFs, images (with OCR), audio files (with transcription), and HTML pages. The tool also ships as an MCP server, making it directly usable by AI assistants like Claude Code.

The primary audience is developers building LLM pipelines who need to ingest structured documents without losing formatting context. Instead of writing format-specific parsers, you feed any file to MarkItDown and get consistent Markdown output.

§02

How it saves time or tokens

Raw document formats like .docx or .pptx contain XML markup, styling metadata, and binary blobs that waste tokens when passed to an LLM. MarkItDown strips all non-content data and outputs compact Markdown, keeping headings, lists, tables, and emphasis intact. The token estimate for this workflow is 979 tokens, a fraction of what raw document XML would consume. The MCP server mode eliminates the need for manual conversion steps entirely.

§03

How to use

  1. Install the package:
pip install 'markitdown[all]'
  1. Convert a file from the command line:
markitdown path/to/file.pptx > output.md
markitdown https://example.com/document.pdf > output.md
  1. Use the Python API for programmatic conversion:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert('report.docx')
print(result.text_content)
§04

Example

Using MarkItDown as an MCP server for Claude Code:

{
  "mcpServers": {
    "markitdown": {
      "command": "markitdown-mcp"
    }
  }
}

Once configured, Claude Code can call MarkItDown to convert documents inline during a conversation, extracting text from uploaded files without manual preprocessing.

§05

Related on TokRepo

§06

Common pitfalls

  • Image-heavy PDFs require OCR dependencies (Tesseract); install them separately or use pip install 'markitdown[all]' to include all optional features
  • Audio transcription depends on external speech-to-text APIs; configure API keys before attempting audio conversion
  • Complex table layouts in Excel with merged cells may not convert perfectly; verify output for critical data extraction tasks

Frequently Asked Questions

What file formats does MarkItDown support?+

MarkItDown handles Word (.docx), PowerPoint (.pptx), Excel (.xlsx), PDF, HTML, images (JPEG, PNG with OCR), and audio files (MP3, WAV with transcription). The 'markitdown[all]' install includes all optional dependencies for the full format range.

How does MarkItDown compare to raw document parsing?+

Raw parsing libraries like python-docx or PyPDF give you low-level access to document internals. MarkItDown abstracts that away and outputs clean Markdown directly. You trade fine-grained control for simplicity and consistency across formats.

Can I use MarkItDown as an MCP server?+

Yes. Install the MCP variant with pip install markitdown-mcp, then add it to your .mcp.json configuration. AI assistants like Claude Code can then call MarkItDown tools to convert documents inline during sessions.

Does MarkItDown preserve tables and formatting?+

Yes, tables are converted to Markdown table syntax, headings maintain their hierarchy, and emphasis (bold, italic) is preserved. Complex nested structures may simplify slightly, but the semantic content remains intact.

Is MarkItDown suitable for production LLM pipelines?+

MarkItDown is designed for exactly this use case. Its compact Markdown output minimizes token usage compared to raw document formats. Microsoft maintains the project actively, and it handles batch processing through its Python API.

Citations (3)
🙏

Source & Thanks

  • GitHub: microsoft/markitdown — 93,000+ stars, MIT License
  • PyPI: markitdown (CLI/library), markitdown-mcp (MCP server)
  • By Microsoft

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.