ScriptsJun 1, 2026·3 min read

LiteParse — Fast Open-Source Document Parser in Rust

A fast, helpful, and open-source document parser by LlamaIndex that extracts structured text from PDFs and other documents with high speed and accuracy for RAG and AI pipelines.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
LiteParse Overview
Direct install command
npx -y tokrepo@latest install 2bc2689f-5df7-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

Introduction

LiteParse is a fast, open-source document parser built in Rust by the LlamaIndex team. It extracts structured text from PDFs and other document formats with a focus on speed and accuracy, making it ideal for RAG pipelines and LLM-powered applications that need to ingest large volumes of documents.

What LiteParse Does

  • Parses PDFs into clean, structured Markdown or JSON output
  • Extracts text with layout awareness: headings, paragraphs, tables, and lists
  • Processes documents significantly faster than Python-based parsers
  • Handles scanned PDFs via integrated OCR capabilities
  • Provides both a CLI tool and Python bindings for programmatic use

Architecture Overview

LiteParse is written in Rust for maximum throughput and compiled into a native binary with Python bindings via PyO3. The parsing pipeline first extracts raw content using a custom PDF reader, then runs layout analysis to classify regions as headings, body text, tables, or figures. A reconstruction step produces clean Markdown or structured JSON preserving the document hierarchy. For scanned pages, an OCR module is invoked automatically.

Self-Hosting & Configuration

  • Install via pip: pip install liteparse
  • No external services or API keys required
  • Configure output format (Markdown, JSON, plain text) via CLI flags
  • Adjust OCR sensitivity and language settings for scanned documents
  • Use the Python API for integration into existing data pipelines

Key Features

  • Rust-powered speed for processing large document collections
  • Layout-aware parsing preserving document structure
  • Automatic OCR fallback for scanned or image-based PDFs
  • Clean Markdown output ready for LLM consumption
  • Python bindings for seamless integration with LlamaIndex and other frameworks

Comparison with Similar Tools

  • PyPDF/PyMuPDF — Python PDF libraries with limited layout analysis; LiteParse adds structure-aware extraction
  • Docling — IBM's document parser; LiteParse is Rust-native and focused on speed
  • Marker — PDF to Markdown converter; LiteParse is built by the LlamaIndex team for RAG pipeline integration
  • Unstructured.io — comprehensive document ETL; LiteParse is lighter and faster for the parsing step

FAQ

Q: How much faster is it compared to Python parsers? A: The Rust core provides significant speed improvements on PDF processing. Benchmarks vary by document complexity.

Q: Does it work with non-PDF documents? A: The primary focus is PDF. Support for additional formats is being added.

Q: Can I use it without the Python wrapper? A: The Rust binary can be used directly from the command line.

Q: Is it production-ready? A: It is actively developed by the LlamaIndex team and used in their production pipelines.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets