Introduction
PageIndex is a document indexing tool that takes a different approach to RAG by building structured page-level indexes instead of embedding-based vector search. It leverages LLM reasoning over document structure to retrieve relevant pages, reducing hallucination and improving answer quality on complex documents.
What PageIndex Does
- Parses PDFs, Markdown, and other document formats into structured page-level representations
- Builds hierarchical indexes based on document structure (sections, headings, page boundaries)
- Retrieves relevant pages using LLM-driven reasoning rather than vector similarity
- Supports multi-document collections with cross-document search
- Provides a query interface that returns page references with context
Architecture Overview
PageIndex processes documents into a page graph where each node holds content and structural metadata. At query time, an LLM reasons over the index structure (table of contents, headings, summaries) to identify relevant pages without computing embeddings. This avoids the chunking and embedding pipeline of traditional RAG.
Self-Hosting & Configuration
- Install via pip; requires Python 3.9+
- Configure the LLM backend for index building and query reasoning
- Supports local models via Ollama or cloud APIs for the reasoning step
- Index files are stored as JSON for easy inspection and version control
- Integrates with LangChain and LlamaIndex as a retriever component
Key Features
- Vectorless retrieval eliminates embedding drift and chunk boundary issues
- Page-level granularity preserves document context better than small chunks
- Structured reasoning lets the LLM navigate documents like a human reader
- Works well with long-form technical documents, manuals, and reports
- Lightweight indexes that are smaller than vector databases
Comparison with Similar Tools
- LlamaIndex — general-purpose RAG framework using vector search; PageIndex uses reasoning-based retrieval
- LangChain retrievers — embedding-based by default; PageIndex provides a complementary non-vector approach
- Unstructured — focuses on document parsing; PageIndex adds structured indexing and retrieval on top
- Docling — document conversion tool; PageIndex goes further with index building and query handling
- RAGFlow — full RAG pipeline with chunking; PageIndex avoids chunking entirely
FAQ
Q: Does PageIndex replace vector databases? A: It offers an alternative. For structured documents, reasoning-based retrieval can outperform vector search on accuracy.
Q: What LLMs work with PageIndex? A: Any LLM accessible via API or local serving. Stronger models produce better reasoning over document structure.
Q: Can it handle scanned PDFs? A: It works best with text-based PDFs. For scanned documents, combine with an OCR tool first.
Q: How large can the document collection be? A: Scales to thousands of documents. Index size grows linearly with page count rather than embedding dimensionality.