[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-pdf-research-paper-rag-en":3,"seo:pack:pdf-research-paper-rag:en":96},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":95},"pdf-research-paper-rag","📚","#7C2D12","new","New · this week","PDF + Research Paper RAG Pack","Ten picks for the researcher, analyst, or lawyer drowning in a corpus of PDFs and papers — built around a real RAG pipeline: ingest → parse (Zerox, OpenDataLoader, Surya) → embed & index (Pinecone Assistant, PageIndex, Cherry Studio KB) → retrieve & chat (RAGFlow, Kotaemon) → rerank (Cohere Rerank) → translate non-English papers (PDFMathTranslate). Install in this order so you can drop a stack of 200 PDFs in one folder and actually have a conversation with it by tonight.",[16,28,36,43,51,59,66,73,80,88],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},758,"3ac555d9-d75c-4208-ba46-974e4a717234","zerox-zero-shot-pdf-ocr-ai-pipelines-3ac555d9","Zerox — Zero-Shot PDF OCR for AI Pipelines","Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.","Script Depot",205,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":34,"view_count":35,"vote_count":24,"lang_type":25,"type":26,"type_label":27},4036,"841f15d1-5079-11f1-9bc6-00163e2b0d79","opendataloader-pdf-ai-ready-document-parser-841f15d1","OpenDataLoader PDF — AI-Ready Document Parser","An open-source PDF parser that automates document accessibility and extracts structured, AI-ready data including tables, text, bounding boxes, and tagged content.","AI Open Source",63,{"id":37,"uuid":38,"slug":39,"title":40,"description":41,"author_name":22,"view_count":42,"vote_count":24,"lang_type":25,"type":26,"type_label":27},263,"66bc0630-1be7-4da3-b227-f1fdb1faa065","surya-document-ocr-90-languages-66bc0630","Surya — Document OCR for 90+ Languages","Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv",385,{"id":44,"uuid":45,"slug":46,"title":47,"description":48,"author_name":49,"view_count":50,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2820,"e8255b25-1bb1-47a8-bff9-ca5a445ce3f1","cherry-studio-knowledge-base-local-rag-with-50-formats","Cherry Studio Knowledge Base — Local RAG with 50+ Formats","Cherry Studio Knowledge Base ingests PDFs, Office docs, Markdown into a local vector index. Query offline, BYOK any LLM. Data stays on your machine.","Cherry Studio",130,{"id":52,"uuid":53,"slug":54,"title":55,"description":56,"author_name":57,"view_count":58,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2812,"63b22f3a-181d-4032-bfa8-3be176e193df","pinecone-assistant-managed-rag-service-with-auto-indexing","Pinecone Assistant — Managed RAG Service with Auto-Indexing","Pinecone Assistant is the fully managed RAG product on Pinecone. Upload PDFs, query with natural language, get cited answers — no chunking pipeline.","Pinecone",95,{"id":60,"uuid":61,"slug":62,"title":63,"description":64,"author_name":34,"view_count":65,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2171,"7421307d-416b-11f1-9bc6-00163e2b0d79","pageindex-document-index-reasoning-based-rag-7421307d","PageIndex — Document Index for Reasoning-Based RAG","A document indexing system that enables vectorless retrieval-augmented generation by building structured page-level indexes for LLM reasoning.",91,{"id":67,"uuid":68,"slug":69,"title":70,"description":71,"author_name":22,"view_count":72,"vote_count":24,"lang_type":25,"type":26,"type_label":27},245,"7785d7a8-fc57-42ab-ba6b-4a970404fadc","ragflow-deep-document-understanding-rag-engine-7785d7a8","RAGFlow — Deep Document Understanding RAG Engine","Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.",251,{"id":74,"uuid":75,"slug":76,"title":77,"description":78,"author_name":22,"view_count":79,"vote_count":24,"lang_type":25,"type":26,"type_label":27},242,"b0f93b10-3339-4ca0-ad20-d6335a3d7785","kotaemon-open-source-rag-document-chat-b0f93b10","Kotaemon — Open-Source RAG Document Chat","Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.",232,{"id":81,"uuid":82,"slug":83,"title":84,"description":85,"author_name":86,"view_count":87,"vote_count":24,"lang_type":25,"type":26,"type_label":27},2824,"bf323939-d2b6-4426-aa9f-9325666e7eaa","cohere-rerank-boost-rag-accuracy-with-rerank-3","Cohere Rerank — Boost RAG Accuracy with Rerank-3","Cohere Rerank scores candidates against a query using a cross-encoder. Drop into any RAG to boost top-1 hit rate by 30-50% over vector search alone.","Cohere",98,{"id":89,"uuid":90,"slug":91,"title":92,"description":93,"author_name":22,"view_count":94,"vote_count":24,"lang_type":25,"type":26,"type_label":27},389,"4c628f43-c803-45c8-ae39-a4caded80419","pdfmathtranslate-translate-pdf-papers-preserving-format-4c628f43","PDFMathTranslate — Translate PDF Papers Preserving Format","Translate PDF scientific papers while preserving math formulas, charts, and layout. Supports Google, DeepL, OpenAI, Ollama. CLI, GUI, MCP, Docker, Zotero plugin.",241,"tokrepo install pack\u002Fpdf-research-paper-rag",{"pageType":97,"pageKey":8,"locale":25,"title":98,"metaDescription":99,"h1":100,"tldr":101,"bodyMarkdown":102,"faq":103,"schema":119,"internalLinks":124,"citations":137,"wordCount":150,"generatedAt":151},"pack","PDF + Research Paper RAG Pack — 10 Tools for Chatting With a Stack of Papers","Zerox, OpenDataLoader PDF, Surya, Cherry Studio KB, Pinecone Assistant, PageIndex, RAGFlow, Kotaemon, Cohere Rerank, PDFMathTranslate. Ingest, parse, embed, retrieve, rerank, translate — a complete RAG pipeline for researchers, analysts, and lawyers buried in PDFs.","PDF + Research Paper RAG Pack — A Working Pipeline for People Buried in Papers","Ten picks arranged as a real RAG pipeline for PDF-heavy work. Parse first (Zerox \u002F OpenDataLoader \u002F Surya), then index (Cherry Studio KB \u002F Pinecone Assistant \u002F PageIndex), then chat (RAGFlow \u002F Kotaemon), then rerank (Cohere Rerank), with PDFMathTranslate for non-English papers. Drop 200 PDFs in tonight and have a conversation with them by morning.","## What's in this pack\n\nIf you are a researcher, analyst, or lawyer, the bottleneck is not search — it is the **PDF**. Papers, contracts, filings, white papers, regulator memos. Most arrive as 1990s-era PDFs with two-column layouts, scanned pages, embedded tables, footnotes that matter more than the body. Throwing them at a general-purpose chatbot fails for the same three reasons every time: parsing is wrong, retrieval is dumb, and the model never sees the right chunk.\n\nThis pack is structured as a **pipeline**, not a shopping list. Each tool owns one stage, and the install order is the order data flows. Different from the [PhD Researcher's Literature + Code Pack](\u002Fen\u002Ftopics\u002Fphd-researcher-lit-code), which covers literature search and code reproduction — this pack assumes you already have the PDFs and need to actually **talk to the corpus**.\n\n## Install in this order\n\n### Stage 1 — Parse (turn PDFs into clean markdown)\n\n1. **Zerox** — vision-model OCR for any PDF. Converts pages to images and asks GPT-4o or Claude to return clean markdown. Wins on dirty scans, two-column papers, and contracts where layout matters. The bet: a frontier vision model beats a 2018 OCR stack on hard PDFs, and you pay only when you actually run it.\n2. **OpenDataLoader PDF** — text-first parser tuned for AI ingestion. Preserves structure (sections, tables, lists) into clean JSON or markdown. Faster and cheaper than Zerox for born-digital PDFs (papers from arXiv, recent contracts). Run this first; fall back to Zerox for the 10% that fail.\n3. **Surya** — open-source OCR for 90+ languages. Mandatory if your corpus has Chinese, Japanese, Arabic, or Cyrillic papers. Runs locally — confidential drafts never leave your machine.\n\n### Stage 2 — Index (embed and store the parsed text)\n\n4. **Cherry Studio Knowledge Base** — local RAG with native support for 50+ formats. The fastest way to drop a folder of PDFs and get a chat UI on top, all on your laptop. Start here unless you need multi-user or cloud.\n5. **Pinecone Assistant** — managed RAG service with auto-indexing. When the corpus crosses ~10k documents or your team needs shared access, Pinecone Assistant handles ingestion, embedding, retrieval, and citations without you wiring it. Trade privacy for scale.\n6. **PageIndex** — document index for reasoning-based RAG. Instead of flat chunk embeddings, PageIndex builds a hierarchical table-of-contents-aware index. The retrieval quality on long papers (40+ pages) is visibly better because the model can reason about *where in the document* an answer lives.\n\n### Stage 3 — Chat (the user-facing layer)\n\n7. **RAGFlow** — deep document understanding RAG engine. The best open-source option for tables, complex layouts, and citation-grounded answers. Self-hosted, runs on Docker, includes a complete chat UI with source highlighting.\n8. **Kotaemon** — open-source RAG document chat (the ChatPDF clone people actually keep using). Lighter than RAGFlow, easier to deploy, hot-swappable LLMs, multi-PDF chat works out of the box.\n\n### Stage 4 — Rerank and Translate\n\n9. **Cohere Rerank** — boost RAG accuracy with Rerank-3. Drop in front of any retriever. The single highest-leverage 10 lines of code you can add to a RAG stack — typical relevance lift is 20-40% on noisy corpora.\n10. **PDFMathTranslate** — translate PDF papers while preserving the original layout, equations, and figures. Essential if half your reading list is in another language and you want a side-by-side compare before feeding it to the index.\n\n## How the stages fit together\n\n```\nPDFs in folder\n   │\n   ├─ OpenDataLoader (born-digital, fast)\n   │\n   ├─ Zerox (dirty scans, complex layouts)\n   │\n   └─ Surya (non-English OCR)\n         │\n         ▼\n   clean markdown + structure\n         │\n         ├─ Cherry Studio KB (local, laptop scale)\n         │\n         ├─ Pinecone Assistant (cloud, team scale)\n         │\n         └─ PageIndex (long-doc, reasoning-aware)\n               │\n               ▼\n         ┌─────────────────┐\n         │ RAGFlow         │\n         │ or Kotaemon     │\n         │ (chat UI)       │\n         └─────────────────┘\n               │\n               + Cohere Rerank in front of retrieval\n               + PDFMathTranslate before ingest for non-EN papers\n```\n\nThe critical insight: **most failing RAG demos fail at parse, not at retrieval**. If your tables come out as `Table 1` with no data, no retriever fixes that. Spend Day 1 on Stage 1; the rest gets easier.\n\n## Tradeoffs you'll hit\n\n- **Local vs cloud** — Cherry Studio KB and Kotaemon run on your laptop; Pinecone Assistant ships your text to a vendor. For confidential corpora (legal, medical, M&A), stay local.\n- **RAGFlow vs Kotaemon** — RAGFlow has the better table parser and citation UI; Kotaemon is easier to deploy and customize. Pick RAGFlow if your corpus is table-heavy (financials, scientific papers); Kotaemon for prose-heavy (legal memos, white papers).\n- **Zerox cost** — vision-model OCR is roughly $0.01-0.03 per page on GPT-4o. A 200-paper corpus at 30 pages average runs $60-180 once. For ongoing pipelines, route only failed parses to Zerox.\n- **Cohere Rerank API key** — adds a third-party dependency. If that's a dealbreaker, you can self-host a reranker (BGE-reranker, Jina), but the integration work is real.\n\n## Common pitfalls\n\n- **Chunk size set blindly to 512 tokens** — fine for general text, catastrophic for papers where a method section runs 4000 tokens. Match chunk size to the document type.\n- **No source highlighting in the chat UI** — researchers won't trust an answer without seeing the page. RAGFlow and Kotaemon both do this well; if you build your own UI, ship citations from day one.\n- **Ingesting before parsing is verified** — open 5 random parsed outputs by hand before pushing 200 PDFs through the embedder. Bad parsing pollutes the index irreversibly.\n- **Forgetting to rerank** — every team adds Cohere Rerank in week 3 after complaining about retrieval quality. Add it in week 1.",[104,107,110,113,116],{"q":105,"a":106},"Do I need all ten tools, or can I start with two or three?","Start with three: one parser (OpenDataLoader PDF for born-digital, or Zerox for dirty scans), one index (Cherry Studio Knowledge Base for laptop scale), and one chat UI (Kotaemon). That triple gets you a working multi-PDF chat in an afternoon. Add Cohere Rerank in week 2 once you feel retrieval quality is the bottleneck, then layer in PageIndex for long documents and PDFMathTranslate for non-English papers. The full stack only makes sense at corpus sizes above a few hundred documents.",{"q":108,"a":109},"How is this different from the PhD Researcher's Literature + Code Pack?","Different stages of the research workflow. The PhD pack covers literature search, reference management, and reproducing paper code (Zotero, arXiv MCP, GPT Researcher, JupyterLab, AI Scientist). This pack assumes you already have the PDFs in a folder and need to extract structured information from them at scale — that means a real RAG pipeline with parse, index, retrieve, rerank stages. Many researchers use both: PhD pack to gather papers, this pack to interrogate them.",{"q":111,"a":112},"Is any of this safe for confidential documents like legal contracts or medical records?","Yes, if you stick to the local-first stack. Surya runs OCR on your machine, Cherry Studio Knowledge Base and Kotaemon both run fully local with local LLM backends (Ollama, llama.cpp), and RAGFlow can be self-hosted in Docker on private infrastructure. The cloud picks (Pinecone Assistant, Cohere Rerank, Zerox via GPT-4o\u002FClaude) all send text off-machine, so route those only to non-sensitive corpora. The Lawyer's AI Contract Review Kit on TokRepo covers privacy-aware tooling in more depth.",{"q":114,"a":115},"What about tables and figures inside the PDFs — do these actually extract them well?","Tables are the hardest part of PDF parsing. RAGFlow has the strongest built-in table parser among open-source options, and OpenDataLoader PDF preserves table structure as JSON when the source PDF is well-tagged. Zerox handles complex layouts because the vision model sees the page like a human would. Figures and equations are harder — PDFMathTranslate is currently the best open option for equations specifically, and for figures most teams settle for keeping the image reference and letting the chat UI surface the original page.",{"q":117,"a":118},"How long does it take to go from a folder of PDFs to a working chat UI?","On a laptop with Cherry Studio KB or Kotaemon, you can be chatting with a small corpus (under 50 PDFs of born-digital text) in about 30 minutes — most of that is the initial parse and embed. A larger corpus (500 PDFs with scans and tables) takes a couple of hours of pipeline work: parse pass with OpenDataLoader, fallback pass with Zerox on the failures, ingest into RAGFlow, then a tuning pass on chunk size and reranker. After that the marginal cost of adding a new PDF is seconds.",{"@context":120,"@type":121,"name":13,"description":122,"numberOfItems":123,"inLanguage":25},"https:\u002F\u002Fschema.org","ItemList","Ten tools arranged as a complete RAG pipeline for PDF-heavy work — parse, index, chat, rerank, translate.",10,[125,129,133],{"url":126,"anchor":127,"reason":128},"\u002Fen\u002Ftopics\u002Fphd-researcher-lit-code","PhD Researcher's Literature + Code Pack","Sibling pack covering lit search and code reproduction — pairs naturally with this PDF-focused pipeline",{"url":130,"anchor":131,"reason":132},"\u002Fen\u002Ftopics\u002Flawyer-ai-contract-kit","Lawyer's AI Contract Review Kit","Confidential-document tooling, complements local-first picks in this pack",{"url":134,"anchor":135,"reason":136},"\u002Fen\u002Fai-tools-for\u002Frag","All RAG tools on TokRepo","Browse the broader RAG catalog beyond this curated pipeline",[138,142,146],{"claim":139,"source_name":140,"source_url":141},"Zerox uses vision models to OCR PDFs into markdown","Zerox GitHub","https:\u002F\u002Fgithub.com\u002Fgetomni-ai\u002Fzerox",{"claim":143,"source_name":144,"source_url":145},"RAGFlow is a deep document understanding RAG engine","RAGFlow GitHub","https:\u002F\u002Fgithub.com\u002Finfiniflow\u002Fragflow",{"claim":147,"source_name":148,"source_url":149},"Cohere Rerank improves retrieval relevance for RAG","Cohere Rerank docs","https:\u002F\u002Fdocs.cohere.com\u002Fdocs\u002Frerank-overview",920,"2026-05-22T12:00:00Z"]