TOKREPO · ARSENAL
New · this week

PDF + Research Paper RAG Pack

Ten picks for the researcher, analyst, or lawyer drowning in a corpus of PDFs and papers — built around a real RAG pipeline: ingest → parse (Zerox, OpenDataLoader, Surya) → embed & index (Pinecone Assistant, PageIndex, Cherry Studio KB) → retrieve & chat (RAGFlow, Kotaemon) → rerank (Cohere Rerank) → translate non-English papers (PDFMathTranslate). Install in this order so you can drop a stack of 200 PDFs in one folder and actually have a conversation with it by tonight.

10 assets

What's in this pack

If you are a researcher, analyst, or lawyer, the bottleneck is not search — it is the PDF. Papers, contracts, filings, white papers, regulator memos. Most arrive as 1990s-era PDFs with two-column layouts, scanned pages, embedded tables, footnotes that matter more than the body. Throwing them at a general-purpose chatbot fails for the same three reasons every time: parsing is wrong, retrieval is dumb, and the model never sees the right chunk.

This pack is structured as a pipeline, not a shopping list. Each tool owns one stage, and the install order is the order data flows. Different from the PhD Researcher's Literature + Code Pack, which covers literature search and code reproduction — this pack assumes you already have the PDFs and need to actually talk to the corpus.

Install in this order

Stage 1 — Parse (turn PDFs into clean markdown)

  1. Zerox — vision-model OCR for any PDF. Converts pages to images and asks GPT-4o or Claude to return clean markdown. Wins on dirty scans, two-column papers, and contracts where layout matters. The bet: a frontier vision model beats a 2018 OCR stack on hard PDFs, and you pay only when you actually run it.
  2. OpenDataLoader PDF — text-first parser tuned for AI ingestion. Preserves structure (sections, tables, lists) into clean JSON or markdown. Faster and cheaper than Zerox for born-digital PDFs (papers from arXiv, recent contracts). Run this first; fall back to Zerox for the 10% that fail.
  3. Surya — open-source OCR for 90+ languages. Mandatory if your corpus has Chinese, Japanese, Arabic, or Cyrillic papers. Runs locally — confidential drafts never leave your machine.

Stage 2 — Index (embed and store the parsed text)

  1. Cherry Studio Knowledge Base — local RAG with native support for 50+ formats. The fastest way to drop a folder of PDFs and get a chat UI on top, all on your laptop. Start here unless you need multi-user or cloud.
  2. Pinecone Assistant — managed RAG service with auto-indexing. When the corpus crosses ~10k documents or your team needs shared access, Pinecone Assistant handles ingestion, embedding, retrieval, and citations without you wiring it. Trade privacy for scale.
  3. PageIndex — document index for reasoning-based RAG. Instead of flat chunk embeddings, PageIndex builds a hierarchical table-of-contents-aware index. The retrieval quality on long papers (40+ pages) is visibly better because the model can reason about where in the document an answer lives.

Stage 3 — Chat (the user-facing layer)

  1. RAGFlow — deep document understanding RAG engine. The best open-source option for tables, complex layouts, and citation-grounded answers. Self-hosted, runs on Docker, includes a complete chat UI with source highlighting.
  2. Kotaemon — open-source RAG document chat (the ChatPDF clone people actually keep using). Lighter than RAGFlow, easier to deploy, hot-swappable LLMs, multi-PDF chat works out of the box.

Stage 4 — Rerank and Translate

  1. Cohere Rerank — boost RAG accuracy with Rerank-3. Drop in front of any retriever. The single highest-leverage 10 lines of code you can add to a RAG stack — typical relevance lift is 20-40% on noisy corpora.
  2. PDFMathTranslate — translate PDF papers while preserving the original layout, equations, and figures. Essential if half your reading list is in another language and you want a side-by-side compare before feeding it to the index.

How the stages fit together

PDFs in folder
   │
   ├─ OpenDataLoader (born-digital, fast)
   │
   ├─ Zerox (dirty scans, complex layouts)
   │
   └─ Surya (non-English OCR)
         │
         ▼
   clean markdown + structure
         │
         ├─ Cherry Studio KB (local, laptop scale)
         │
         ├─ Pinecone Assistant (cloud, team scale)
         │
         └─ PageIndex (long-doc, reasoning-aware)
               │
               ▼
         ┌─────────────────┐
         │ RAGFlow         │
         │ or Kotaemon     │
         │ (chat UI)       │
         └─────────────────┘
               │
               + Cohere Rerank in front of retrieval
               + PDFMathTranslate before ingest for non-EN papers

The critical insight: most failing RAG demos fail at parse, not at retrieval. If your tables come out as Table 1 with no data, no retriever fixes that. Spend Day 1 on Stage 1; the rest gets easier.

Tradeoffs you'll hit

  • Local vs cloud — Cherry Studio KB and Kotaemon run on your laptop; Pinecone Assistant ships your text to a vendor. For confidential corpora (legal, medical, M&A), stay local.
  • RAGFlow vs Kotaemon — RAGFlow has the better table parser and citation UI; Kotaemon is easier to deploy and customize. Pick RAGFlow if your corpus is table-heavy (financials, scientific papers); Kotaemon for prose-heavy (legal memos, white papers).
  • Zerox cost — vision-model OCR is roughly $0.01-0.03 per page on GPT-4o. A 200-paper corpus at 30 pages average runs $60-180 once. For ongoing pipelines, route only failed parses to Zerox.
  • Cohere Rerank API key — adds a third-party dependency. If that's a dealbreaker, you can self-host a reranker (BGE-reranker, Jina), but the integration work is real.

Common pitfalls

  • Chunk size set blindly to 512 tokens — fine for general text, catastrophic for papers where a method section runs 4000 tokens. Match chunk size to the document type.
  • No source highlighting in the chat UI — researchers won't trust an answer without seeing the page. RAGFlow and Kotaemon both do this well; if you build your own UI, ship citations from day one.
  • Ingesting before parsing is verified — open 5 random parsed outputs by hand before pushing 200 PDFs through the embedder. Bad parsing pollutes the index irreversibly.
  • Forgetting to rerank — every team adds Cohere Rerank in week 3 after complaining about retrieval quality. Add it in week 1.
INSTALL · ONE COMMAND
$ tokrepo install pack/pdf-research-paper-rag
hand it to your agent — or paste it in your terminal
What's inside

10 assets in this pack

Skill#01
Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

by Script Depot·205 views
$ tokrepo install zerox-zero-shot-pdf-ocr-ai-pipelines-3ac555d9
Skill#02
OpenDataLoader PDF — AI-Ready Document Parser

An open-source PDF parser that automates document accessibility and extracts structured, AI-ready data including tables, text, bounding boxes, and tagged content.

by AI Open Source·63 views
$ tokrepo install opendataloader-pdf-ai-ready-document-parser-841f15d1
Skill#03
Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

by Script Depot·385 views
$ tokrepo install surya-document-ocr-90-languages-66bc0630
Skill#04
Cherry Studio Knowledge Base — Local RAG with 50+ Formats

Cherry Studio Knowledge Base ingests PDFs, Office docs, Markdown into a local vector index. Query offline, BYOK any LLM. Data stays on your machine.

by Cherry Studio·130 views
$ tokrepo install cherry-studio-knowledge-base-local-rag-with-50-formats
Skill#05
Pinecone Assistant — Managed RAG Service with Auto-Indexing

Pinecone Assistant is the fully managed RAG product on Pinecone. Upload PDFs, query with natural language, get cited answers — no chunking pipeline.

by Pinecone·95 views
$ tokrepo install pinecone-assistant-managed-rag-service-with-auto-indexing
Skill#06
PageIndex — Document Index for Reasoning-Based RAG

A document indexing system that enables vectorless retrieval-augmented generation by building structured page-level indexes for LLM reasoning.

by AI Open Source·91 views
$ tokrepo install pageindex-document-index-reasoning-based-rag-7421307d
Skill#07
RAGFlow — Deep Document Understanding RAG Engine

Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.

by Script Depot·251 views
$ tokrepo install ragflow-deep-document-understanding-rag-engine-7785d7a8
Skill#08
Kotaemon — Open-Source RAG Document Chat

Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.

by Script Depot·232 views
$ tokrepo install kotaemon-open-source-rag-document-chat-b0f93b10
Skill#09
Cohere Rerank — Boost RAG Accuracy with Rerank-3

Cohere Rerank scores candidates against a query using a cross-encoder. Drop into any RAG to boost top-1 hit rate by 30-50% over vector search alone.

by Cohere·98 views
$ tokrepo install cohere-rerank-boost-rag-accuracy-with-rerank-3
Skill#10
PDFMathTranslate — Translate PDF Papers Preserving Format

Translate PDF scientific papers while preserving math formulas, charts, and layout. Supports Google, DeepL, OpenAI, Ollama. CLI, GUI, MCP, Docker, Zotero plugin.

by Script Depot·241 views
$ tokrepo install pdfmathtranslate-translate-pdf-papers-preserving-format-4c628f43
FAQ

Frequently asked questions

Do I need all ten tools, or can I start with two or three?

Start with three: one parser (OpenDataLoader PDF for born-digital, or Zerox for dirty scans), one index (Cherry Studio Knowledge Base for laptop scale), and one chat UI (Kotaemon). That triple gets you a working multi-PDF chat in an afternoon. Add Cohere Rerank in week 2 once you feel retrieval quality is the bottleneck, then layer in PageIndex for long documents and PDFMathTranslate for non-English papers. The full stack only makes sense at corpus sizes above a few hundred documents.

How is this different from the PhD Researcher's Literature + Code Pack?

Different stages of the research workflow. The PhD pack covers literature search, reference management, and reproducing paper code (Zotero, arXiv MCP, GPT Researcher, JupyterLab, AI Scientist). This pack assumes you already have the PDFs in a folder and need to extract structured information from them at scale — that means a real RAG pipeline with parse, index, retrieve, rerank stages. Many researchers use both: PhD pack to gather papers, this pack to interrogate them.

Is any of this safe for confidential documents like legal contracts or medical records?

Yes, if you stick to the local-first stack. Surya runs OCR on your machine, Cherry Studio Knowledge Base and Kotaemon both run fully local with local LLM backends (Ollama, llama.cpp), and RAGFlow can be self-hosted in Docker on private infrastructure. The cloud picks (Pinecone Assistant, Cohere Rerank, Zerox via GPT-4o/Claude) all send text off-machine, so route those only to non-sensitive corpora. The Lawyer's AI Contract Review Kit on TokRepo covers privacy-aware tooling in more depth.

What about tables and figures inside the PDFs — do these actually extract them well?

Tables are the hardest part of PDF parsing. RAGFlow has the strongest built-in table parser among open-source options, and OpenDataLoader PDF preserves table structure as JSON when the source PDF is well-tagged. Zerox handles complex layouts because the vision model sees the page like a human would. Figures and equations are harder — PDFMathTranslate is currently the best open option for equations specifically, and for figures most teams settle for keeping the image reference and letting the chat UI surface the original page.

How long does it take to go from a folder of PDFs to a working chat UI?

On a laptop with Cherry Studio KB or Kotaemon, you can be chatting with a small corpus (under 50 PDFs of born-digital text) in about 30 minutes — most of that is the initial parse and embed. A larger corpus (500 PDFs with scans and tables) takes a couple of hours of pipeline work: parse pass with OpenDataLoader, fallback pass with Zerox on the failures, ingest into RAGFlow, then a tuning pass on chunk size and reranker. After that the marginal cost of adding a new PDF is seconds.

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs