Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsApr 7, 2026·2 min de lectura

MarkItDown — Convert Any File to Markdown for LLMs

Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown. Perfect for feeding documents into LLM context windows. 8,000+ stars.

Introducción

MarkItDown is a Python library by Microsoft that converts virtually any document format to clean Markdown with 8,000+ GitHub stars. Feed PDFs, Word docs, PowerPoints, Excel spreadsheets, images (with OCR), audio (with transcription), and HTML into it and get LLM-ready Markdown out. Unlike Docling which focuses on layout-aware PDF parsing, MarkItDown prioritizes breadth — it handles 10+ formats with a single API. Best for developers building RAG pipelines or tools that need to ingest diverse document types. Works with: any LLM pipeline. Setup time: under 1 minute.


Supported Formats

Format Extension Features
PDF .pdf Text extraction, tables
Word .docx Headers, lists, tables, images
PowerPoint .pptx Slide text, speaker notes
Excel .xlsx Tables with headers
HTML .html Clean text extraction
Images .jpg, .png OCR via Azure/OpenAI Vision
Audio .mp3, .wav Transcription via Whisper
CSV .csv Table format
JSON .json Structured text
XML .xml Text extraction
ZIP .zip Processes contained files

Batch Conversion

from pathlib import Path

md = MarkItDown()
for file in Path("./documents").glob("*.*"):
    result = md.convert(str(file))
    Path(f"./markdown/{file.stem}.md").write_text(result.text_content)

Image OCR (with LLM)

md = MarkItDown(llm_client=openai_client, llm_model="gpt-4o")
result = md.convert("screenshot.png")
# Uses vision model to describe and extract text from images

Audio Transcription

result = md.convert("meeting_recording.mp3")
# Uses Whisper for speech-to-text, outputs as Markdown

RAG Pipeline Integration

from markitdown import MarkItDown
from langchain.text_splitter import RecursiveCharacterTextSplitter

md = MarkItDown()
doc = md.convert("quarterly_report.pdf")
chunks = RecursiveCharacterTextSplitter(chunk_size=512).split_text(doc.text_content)
# Feed chunks into your vector database

MarkItDown vs Docling

Feature MarkItDown Docling
Focus Format breadth PDF accuracy
Formats 10+ (PDF, DOCX, PPTX, audio...) 6 (PDF, DOCX, PPTX, HTML...)
Table accuracy Good Excellent
Figure extraction Basic Advanced
OCR Via LLM vision Built-in models
By Microsoft IBM Research

Key Stats

  • 8,000+ GitHub stars
  • By Microsoft
  • 10+ input formats
  • Image OCR and audio transcription
  • Single API for all formats

FAQ

Q: What is MarkItDown? A: A Python library by Microsoft that converts PDF, DOCX, PPTX, XLSX, images, audio, and HTML to clean Markdown for LLM consumption.

Q: Is MarkItDown free? A: Yes, open-source under MIT license.

Q: MarkItDown or Docling? A: MarkItDown for diverse formats (10+ types). Docling for high-accuracy PDF parsing with complex layouts.


🙏

Fuente y agradecimientos

Created by Microsoft. Licensed under MIT.

markitdown — stars 8,000+

Thanks to Microsoft for making document-to-Markdown universal.

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados