ConfigsApr 7, 2026·2 min read

MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

AI
AI Open Source · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install markitdown
from markitdown import MarkItDown

md = MarkItDown()

# Convert any file to Markdown
result = md.convert("report.pdf")
print(result.text_content)

result = md.convert("presentation.pptx")
print(result.text_content)

result = md.convert("spreadsheet.xlsx")
print(result.text_content)

CLI usage:

markitdown report.pdf > report.md
markitdown presentation.pptx > slides.md

Intro

MarkItDown is a Python library by Microsoft that converts virtually any document format to clean Markdown with 8,000+ GitHub stars. Feed PDFs, Word docs, PowerPoints, Excel spreadsheets, images (with OCR), audio (with transcription), and HTML into it and get LLM-ready Markdown out. Unlike Docling which focuses on layout-aware PDF parsing, MarkItDown prioritizes breadth — it handles 10+ formats with a single API. Best for developers building RAG pipelines or tools that need to ingest diverse document types. Works with: any LLM pipeline. Setup time: under 1 minute.


Supported Formats

Format Extension Features
PDF .pdf Text extraction, tables
Word .docx Headers, lists, tables, images
PowerPoint .pptx Slide text, speaker notes
Excel .xlsx Tables with headers
HTML .html Clean text extraction
Images .jpg, .png OCR via Azure/OpenAI Vision
Audio .mp3, .wav Transcription via Whisper
CSV .csv Table format
JSON .json Structured text
XML .xml Text extraction
ZIP .zip Processes contained files

Batch Conversion

from pathlib import Path

md = MarkItDown()
for file in Path("./documents").glob("*.*"):
    result = md.convert(str(file))
    Path(f"./markdown/{file.stem}.md").write_text(result.text_content)

Image OCR (with LLM)

md = MarkItDown(llm_client=openai_client, llm_model="gpt-4o")
result = md.convert("screenshot.png")
# Uses vision model to describe and extract text from images

Audio Transcription

result = md.convert("meeting_recording.mp3")
# Uses Whisper for speech-to-text, outputs as Markdown

RAG Pipeline Integration

from markitdown import MarkItDown
from langchain.text_splitter import RecursiveCharacterTextSplitter

md = MarkItDown()
doc = md.convert("quarterly_report.pdf")
chunks = RecursiveCharacterTextSplitter(chunk_size=512).split_text(doc.text_content)
# Feed chunks into your vector database

MarkItDown vs Docling

Feature MarkItDown Docling
Focus Format breadth PDF accuracy
Formats 10+ (PDF, DOCX, PPTX, audio...) 6 (PDF, DOCX, PPTX, HTML...)
Table accuracy Good Excellent
Figure extraction Basic Advanced
OCR Via LLM vision Built-in models
By Microsoft IBM Research

Key Stats

  • 8,000+ GitHub stars
  • By Microsoft
  • 10+ input formats
  • Image OCR and audio transcription
  • Single API for all formats

FAQ

Q: What is MarkItDown? A: A Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown for LLM consumption.

Q: Is MarkItDown free? A: Yes, open-source under MIT license.

Q: MarkItDown or Docling? A: MarkItDown for diverse formats (10+ types). Docling for high-accuracy PDF parsing with complex layouts.


🙏

Source & Thanks

Created by Microsoft. Licensed under MIT.

markitdown — stars 8,000+

Thanks to Microsoft for making document-to-Markdown universal.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets