# MarkItDown — Convert Any File to Markdown for LLMs > Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars. ## Install Save in your project root: ## Quick Use ```bash pip install markitdown ``` ```python from markitdown import MarkItDown md = MarkItDown() # Convert any file to Markdown result = md.convert("report.pdf") print(result.text_content) result = md.convert("presentation.pptx") print(result.text_content) result = md.convert("spreadsheet.xlsx") print(result.text_content) ``` CLI usage: ```bash markitdown report.pdf > report.md markitdown presentation.pptx > slides.md ``` --- ## Intro MarkItDown is a Python library by Microsoft that converts virtually any document format to clean Markdown with 8,000+ GitHub stars. Feed PDFs, Word docs, PowerPoints, Excel spreadsheets, images (with OCR), audio (with transcription), and HTML into it and get LLM-ready Markdown out. Unlike Docling which focuses on layout-aware PDF parsing, MarkItDown prioritizes breadth — it handles 10+ formats with a single API. Best for developers building RAG pipelines or tools that need to ingest diverse document types. Works with: any LLM pipeline. Setup time: under 1 minute. --- ## Supported Formats | Format | Extension | Features | |--------|-----------|----------| | PDF | .pdf | Text extraction, tables | | Word | .docx | Headers, lists, tables, images | | PowerPoint | .pptx | Slide text, speaker notes | | Excel | .xlsx | Tables with headers | | HTML | .html | Clean text extraction | | Images | .jpg, .png | OCR via Azure/OpenAI Vision | | Audio | .mp3, .wav | Transcription via Whisper | | CSV | .csv | Table format | | JSON | .json | Structured text | | XML | .xml | Text extraction | | ZIP | .zip | Processes contained files | ### Batch Conversion ```python from pathlib import Path md = MarkItDown() for file in Path("./documents").glob("*.*"): result = md.convert(str(file)) Path(f"./markdown/{file.stem}.md").write_text(result.text_content) ``` ### Image OCR (with LLM) ```python md = MarkItDown(llm_client=openai_client, llm_model="gpt-4o") result = md.convert("screenshot.png") # Uses vision model to describe and extract text from images ``` ### Audio Transcription ```python result = md.convert("meeting_recording.mp3") # Uses Whisper for speech-to-text, outputs as Markdown ``` ### RAG Pipeline Integration ```python from markitdown import MarkItDown from langchain.text_splitter import RecursiveCharacterTextSplitter md = MarkItDown() doc = md.convert("quarterly_report.pdf") chunks = RecursiveCharacterTextSplitter(chunk_size=512).split_text(doc.text_content) # Feed chunks into your vector database ``` ### MarkItDown vs Docling | Feature | MarkItDown | Docling | |---------|-----------|--------| | Focus | Format breadth | PDF accuracy | | Formats | 10+ (PDF, DOCX, PPTX, audio...) | 6 (PDF, DOCX, PPTX, HTML...) | | Table accuracy | Good | Excellent | | Figure extraction | Basic | Advanced | | OCR | Via LLM vision | Built-in models | | By | Microsoft | IBM Research | ### Key Stats - 8,000+ GitHub stars - By Microsoft - 10+ input formats - Image OCR and audio transcription - Single API for all formats ### FAQ **Q: What is MarkItDown?** A: A Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown for LLM consumption. **Q: Is MarkItDown free?** A: Yes, open-source under MIT license. **Q: MarkItDown or Docling?** A: MarkItDown for diverse formats (10+ types). Docling for high-accuracy PDF parsing with complex layouts. --- ## Source & Thanks > Created by [Microsoft](https://github.com/microsoft). Licensed under MIT. > > [markitdown](https://github.com/microsoft/markitdown) — stars 8,000+ Thanks to Microsoft for making document-to-Markdown universal. --- ## 快速使用 ```bash pip install markitdown markitdown report.pdf > report.md ``` --- ## 简介 MarkItDown 是微软开发的 Python 库,GitHub 8,000+ stars。将 PDF、DOCX、PPTX、XLSX、图片、音频和 HTML 转换为干净 Markdown。支持 10+ 格式,单一 API。适合构建需要摄取多种文档类型的 RAG 管道。 --- ## 来源与感谢 > Created by [Microsoft](https://github.com/microsoft). Licensed under MIT. > > [markitdown](https://github.com/microsoft/markitdown) — stars 8,000+ --- Source: https://tokrepo.com/en/workflows/6fdc90c2-bede-4d3a-98d7-faf751dfb41f Author: AI Open Source