Supported Formats
| Format | Extension | Features |
|---|---|---|
| Text extraction, tables | ||
| Word | .docx | Headers, lists, tables, images |
| PowerPoint | .pptx | Slide text, speaker notes |
| Excel | .xlsx | Tables with headers |
| HTML | .html | Clean text extraction |
| Images | .jpg, .png | OCR via Azure/OpenAI Vision |
| Audio | .mp3, .wav | Transcription via Whisper |
| CSV | .csv | Table format |
| JSON | .json | Structured text |
| XML | .xml | Text extraction |
| ZIP | .zip | Processes contained files |
Batch Conversion
from pathlib import Path
md = MarkItDown()
for file in Path("./documents").glob("*.*"):
result = md.convert(str(file))
Path(f"./markdown/{file.stem}.md").write_text(result.text_content)Image OCR (with LLM)
md = MarkItDown(llm_client=openai_client, llm_model="gpt-4o")
result = md.convert("screenshot.png")
# Uses vision model to describe and extract text from imagesAudio Transcription
result = md.convert("meeting_recording.mp3")
# Uses Whisper for speech-to-text, outputs as MarkdownRAG Pipeline Integration
from markitdown import MarkItDown
from langchain.text_splitter import RecursiveCharacterTextSplitter
md = MarkItDown()
doc = md.convert("quarterly_report.pdf")
chunks = RecursiveCharacterTextSplitter(chunk_size=512).split_text(doc.text_content)
# Feed chunks into your vector databaseMarkItDown vs Docling
| Feature | MarkItDown | Docling |
|---|---|---|
| Focus | Format breadth | PDF accuracy |
| Formats | 10+ (PDF, DOCX, PPTX, audio...) | 6 (PDF, DOCX, PPTX, HTML...) |
| Table accuracy | Good | Excellent |
| Figure extraction | Basic | Advanced |
| OCR | Via LLM vision | Built-in models |
| By | Microsoft | IBM Research |
Key Stats
- 8,000+ GitHub stars
- By Microsoft
- 10+ input formats
- Image OCR and audio transcription
- Single API for all formats
FAQ
Q: What is MarkItDown? A: A Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown for LLM consumption.
Q: Is MarkItDown free? A: Yes, open-source under MIT license.
Q: MarkItDown or Docling? A: MarkItDown for diverse formats (10+ types). Docling for high-accuracy PDF parsing with complex layouts.