# MarkItDown — Convert Any Document to Markdown > Microsoft's Python tool to convert Office docs, PDFs, images, audio, and HTML to clean Markdown for LLM pipelines. Also available as MCP server. ## Install Merge the JSON below into your `.mcp.json`: # MarkItDown — Convert Any Document to Markdown ## Quick Use ```bash pip install 'markitdown[all]' # Convert a file markitdown path/to/file.pptx > output.md # Convert from URL markitdown https://example.com/document.pdf > output.md ``` **As MCP Server (for Claude Code, Cursor, etc.):** ```bash pip install markitdown-mcp ``` ```json { "mcpServers": { "markitdown": { "command": "markitdown-mcp" } } } ``` **Python API:** ```python from markitdown import MarkItDown md = MarkItDown() result = md.convert("report.docx") print(result.text_content) ``` ## Intro MarkItDown by Microsoft converts virtually any document format into clean Markdown — the lingua franca of LLMs. Feed it Word docs, PowerPoint decks, Excel spreadsheets, PDFs, images (with OCR/AI description), audio (with speech-to-text), HTML, CSV, JSON, XML, ZIP archives, and more. Out comes clean, structured Markdown ready for any AI pipeline. With 93,000+ GitHub stars, it's become the standard tool for document-to-LLM preprocessing. The MCP server variant (`markitdown-mcp`) lets AI coding agents convert documents on the fly during conversations. ## Details ### Supported Formats | Format | Conversion Method | |--------|-------------------| | **Word (.docx)** | Structure-preserving with headings, tables, lists | | **PowerPoint (.pptx)** | Slide-by-slide with speaker notes | | **Excel (.xlsx)** | Sheet-by-sheet as Markdown tables | | **PDF** | Text extraction with layout preservation | | **Images** | OCR + AI description (EXIF metadata included) | | **Audio (.mp3/.wav)** | Speech-to-text transcription | | **HTML** | Clean text extraction, tables preserved | | **CSV/JSON/XML** | Structured Markdown conversion | | **ZIP archives** | Recursive conversion of all contained files | ### LLM Integration via MCP The `markitdown-mcp` server exposes a `convert` tool that AI agents can call to convert any file or URL to Markdown during a conversation. Works with Claude Code, Cursor, Windsurf, and any MCP-compatible client. ### Advanced Usage ```python from markitdown import MarkItDown # With LLM for image descriptions md = MarkItDown(llm_client=openai_client, llm_model="gpt-4o") result = md.convert("photo.jpg") # → "A bar chart showing quarterly revenue growth..." # Batch convert a directory import glob for f in glob.glob("docs/*.docx"): result = md.convert(f) open(f.replace(".docx", ".md"), "w").write(result.text_content) ``` ## Frequently Asked Questions **Q: How does it handle complex PDF layouts?** A: MarkItDown extracts text with layout heuristics. For scanned PDFs, it uses OCR. Complex multi-column layouts may need manual cleanup. **Q: Can it handle large files?** A: Yes. It streams content and handles files of any size. Memory usage scales with file complexity, not raw size. **Q: Is the MCP server stateless?** A: Yes. Each `convert` call is independent. The server doesn't store files between calls. ## Works With - Any LLM pipeline (Claude, GPT, Gemini, local models) - MCP clients (Claude Code, Cursor, Windsurf, Codex) - Python 3.10+ - Office 365, Google Docs (export first), PDF readers ## Source & Thanks - **GitHub**: [microsoft/markitdown](https://github.com/microsoft/markitdown) — 93,000+ stars, MIT License - **PyPI**: `markitdown` (CLI/library), `markitdown-mcp` (MCP server) - By Microsoft --- # MarkItDown — 把任何文档转为 Markdown ## 快速使用 ```bash pip install 'markitdown[all]' markitdown report.docx > output.md ``` ## 介绍 微软开源的文档转 Markdown 工具。支持 Word、PPT、Excel、PDF、图片(OCR)、音频(语音识别)、HTML 等格式。93,000+ stars,已成为 LLM 文档预处理的事实标准。附带 MCP 服务器版本,AI Agent 可直接调用。 ## 来源与感谢 - **GitHub**: [microsoft/markitdown](https://github.com/microsoft/markitdown) — 93,000+ stars, MIT 许可证 --- Source: https://tokrepo.com/en/workflows/55fe10f5-e743-40da-9da0-49fadf6e73c7 Author: MCP Hub