MarkItDown — Convert Any Document to Markdown
Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server.
What it is
MarkItDown is an open-source Python tool by Microsoft that converts a wide range of file formats into clean Markdown text. Supported inputs include Word documents, PowerPoint presentations, Excel spreadsheets, PDFs, images (with OCR), audio files (with transcription), and HTML pages. The tool also ships as an MCP server, making it directly usable by AI assistants like Claude Code.
The primary audience is developers building LLM pipelines who need to ingest structured documents without losing formatting context. Instead of writing format-specific parsers, you feed any file to MarkItDown and get consistent Markdown output.
How it saves time or tokens
Raw document formats like .docx or .pptx contain XML markup, styling metadata, and binary blobs that waste tokens when passed to an LLM. MarkItDown strips all non-content data and outputs compact Markdown, keeping headings, lists, tables, and emphasis intact. The token estimate for this workflow is 979 tokens, a fraction of what raw document XML would consume. The MCP server mode eliminates the need for manual conversion steps entirely.
How to use
- Install the package:
pip install 'markitdown[all]'
- Convert a file from the command line:
markitdown path/to/file.pptx > output.md
markitdown https://example.com/document.pdf > output.md
- Use the Python API for programmatic conversion:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert('report.docx')
print(result.text_content)
Example
Using MarkItDown as an MCP server for Claude Code:
{
"mcpServers": {
"markitdown": {
"command": "markitdown-mcp"
}
}
}
Once configured, Claude Code can call MarkItDown to convert documents inline during a conversation, extracting text from uploaded files without manual preprocessing.
Related on TokRepo
- AI Tools for Documents — more tools for document processing and conversion in AI workflows
- MCP Integrations — browse MCP servers that extend AI assistant capabilities
Common pitfalls
- Image-heavy PDFs require OCR dependencies (Tesseract); install them separately or use
pip install 'markitdown[all]'to include all optional features - Audio transcription depends on external speech-to-text APIs; configure API keys before attempting audio conversion
- Complex table layouts in Excel with merged cells may not convert perfectly; verify output for critical data extraction tasks
Frequently Asked Questions
MarkItDown handles Word (.docx), PowerPoint (.pptx), Excel (.xlsx), PDF, HTML, images (JPEG, PNG with OCR), and audio files (MP3, WAV with transcription). The 'markitdown[all]' install includes all optional dependencies for the full format range.
Raw parsing libraries like python-docx or PyPDF give you low-level access to document internals. MarkItDown abstracts that away and outputs clean Markdown directly. You trade fine-grained control for simplicity and consistency across formats.
Yes. Install the MCP variant with pip install markitdown-mcp, then add it to your .mcp.json configuration. AI assistants like Claude Code can then call MarkItDown tools to convert documents inline during sessions.
Yes, tables are converted to Markdown table syntax, headings maintain their hierarchy, and emphasis (bold, italic) is preserved. Complex nested structures may simplify slightly, but the semantic content remains intact.
MarkItDown is designed for exactly this use case. Its compact Markdown output minimizes token usage compared to raw document formats. Microsoft maintains the project actively, and it handles batch processing through its Python API.
Citations (3)
- MarkItDown GitHub— MarkItDown converts documents to Markdown
- MarkItDown MCP PyPI— MCP server variant for AI assistant integration
- Microsoft Open Source— Microsoft open-source project
Related on TokRepo
Source & Thanks
- GitHub: microsoft/markitdown — 93,000+ stars, MIT License
- PyPI:
markitdown(CLI/library),markitdown-mcp(MCP server) - By Microsoft