ConfigsApr 7, 2026·2 min read

MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

TL;DR
MarkItDown converts PDF, DOCX, PPTX, images, and audio to clean Markdown for feeding into LLM context windows.
§01

What it is

MarkItDown is a Python library by Microsoft that converts a wide range of file formats into clean Markdown text. It handles PDF, DOCX, PPTX, XLSX, images (via OCR), audio (via transcription), and HTML. The output is structured Markdown suitable for feeding into LLM context windows.

It targets developers building RAG pipelines, document processing systems, or any application that needs to ingest diverse file formats into an LLM-friendly text representation.

§02

How it saves time or tokens

MarkItDown produces clean, structured Markdown without the noise of raw extraction tools. Tables stay as Markdown tables, headings preserve hierarchy, and images get OCR text. This means fewer tokens wasted on formatting artifacts. Estimated token usage is around 2,400 tokens.

§03

How to use

  1. Install MarkItDown:
pip install markitdown
  1. Convert any file:
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert('report.pdf')
print(result.text_content)
  1. Use the output as LLM context or store it in a vector database.
§04

Example

from markitdown import MarkItDown

md = MarkItDown()

# Convert PDF
pdf_result = md.convert('quarterly-report.pdf')
print(pdf_result.text_content[:500])

# Convert DOCX
doc_result = md.convert('proposal.docx')
print(doc_result.text_content[:500])

# Convert PPTX
slides = md.convert('deck.pptx')
print(slides.text_content[:500])
§05

Related on TokRepo

Key considerations

When evaluating MarkItDown for your workflow, consider the following factors. First, assess whether your team has the technical prerequisites to adopt this tool effectively. Second, evaluate the maintenance burden against the productivity gains. Third, check community activity and documentation quality to ensure long-term viability. Integration with your existing toolchain matters more than feature count alone. Start with a small pilot project before rolling out across the organization. Monitor resource usage during the initial adoption phase to identify bottlenecks early. Document your configuration decisions so team members can onboard independently.

§06

Common pitfalls

  • Scanned PDFs without embedded text require OCR; install optional OCR dependencies for these files.
  • Audio transcription requires additional dependencies (speech recognition libraries); the base install handles text-based formats only.
  • Very large files produce extensive Markdown that may exceed LLM context limits; chunk the output before feeding it to a model.

Frequently Asked Questions

Which file formats does MarkItDown support?+

MarkItDown supports PDF, DOCX, PPTX, XLSX, images (PNG, JPG), audio (WAV, MP3), HTML, CSV, and more. The library uses format-specific parsers to produce structured Markdown for each type.

Does MarkItDown handle tables?+

Yes. Tables from XLSX, DOCX, and PDF files are converted to Markdown table syntax. This preserves the tabular structure in a format that LLMs can process effectively.

Can I use MarkItDown in a RAG pipeline?+

Yes. MarkItDown is commonly used as the document parsing step in RAG pipelines. Convert files to Markdown, chunk the text, embed chunks into a vector database, and retrieve them at query time.

Is OCR built in?+

MarkItDown supports OCR for images and scanned PDFs, but it requires optional dependencies. Install the OCR extras to enable this feature. Text-based documents work without additional dependencies.

How does MarkItDown compare to other document parsers?+

MarkItDown focuses on producing clean Markdown specifically for LLM consumption. Other tools like Apache Tika or textract extract raw text. MarkItDown preserves document structure (headings, tables, lists) in Markdown format.

Citations (3)
🙏

Source & Thanks

Created by Microsoft. Licensed under MIT.

markitdown — stars 8,000+

Thanks to Microsoft for making document-to-Markdown universal.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets