TOKREPO · 主题包

稳定

PDF + 论文 RAG 工具包

面向被一堆 PDF 和论文淹没的研究员、分析师、律师：围绕一条真正的 RAG 流水线挑的十件套——ingest → 解析（Zerox / OpenDataLoader / Surya）→ 嵌入+索引（Pinecone Assistant / PageIndex / Cherry Studio 知识库）→ 检索+对话（RAGFlow / Kotaemon）→ 重排（Cohere Rerank）→ 翻译非英文论文（PDFMathTranslate）。按顺序装，今晚就能把 200 篇 PDF 丢进一个文件夹然后跟它对话。

10 个资产

关于这个主题包

这个 pack 包含什么

如果你是研究员、分析师、律师，瓶颈不在搜索 — 在 PDF。论文、合同、备案、白皮书、监管备忘录。大多数都是 90 年代风格的 PDF：双栏排版、扫描页、嵌入表格、比正文还重要的脚注。直接丢给通用聊天机器人每次都败在三件事上：解析错、检索蠢、模型看不到对的 chunk。

这个 pack 按流水线组织，不是购物清单。每个工具只管一个阶段，安装顺序就是数据流动的顺序。和 PhD 博士研究者的文献 + 复现代码包不同 — 后者解决文献检索和代码复现；本 pack 假设你已经攒了一堆 PDF，需要真的跟语料对话。

各阶段怎么协同

一个文件夹的 PDF
   │
   ├─ OpenDataLoader (原生数字 PDF，快)
   │
   ├─ Zerox (脏扫描、复杂版式)
   │
   └─ Surya (非英文 OCR)
         │
         ▼
   干净 markdown + 结构
         │
         ├─ Cherry Studio 知识库 (本地，笔记本规模)
         │
         ├─ Pinecone Assistant (云端，团队规模)
         │
         └─ PageIndex (长文档，推理感知)
               │
               ▼
         ┌─────────────────┐
         │ RAGFlow         │
         │ 或 Kotaemon     │
         │ (聊天 UI)       │
         └─────────────────┘
               │
               + 检索前接 Cohere Rerank
               + 非英文论文 ingest 前过 PDFMathTranslate

关键洞察：绝大多数翻车的 RAG demo 死在解析阶段，不是检索阶段。如果你的表格出来只剩「表 1」没有数据，再聪明的检索器也救不回来。Day 1 砸在阶段 1，后面全都变简单。

你会遇到的取舍

本地 vs 云 — Cherry Studio 知识库和 Kotaemon 在笔记本上跑；Pinecone Assistant 把文本送到供应商。机密语料（法律、医疗、并购），坚持本地。
RAGFlow vs Kotaemon — RAGFlow 表格解析和引用 UI 更强；Kotaemon 部署和定制更简单。语料表格重（财报、科学论文）选 RAGFlow；散文重（法律备忘录、白皮书）选 Kotaemon。
Zerox 成本 — 视觉模型 OCR 在 GPT-4o 上大概 0.01-0.03 美元/页。200 篇平均 30 页的语料一次性大概 60-180 美元。持续流水线建议只把解析失败的回落到 Zerox。
Cohere Rerank API key — 多了一个第三方依赖。如果不能接受，可以自托管重排器（BGE-reranker、Jina），但集成成本是真的。

常见踩坑

chunk size 闭眼定 512 token — 一般文本可以，论文里 4000 token 一个 method section 就废了。chunk size 要按文档类型调。
聊天 UI 不带来源高亮 — 研究员看不到原页就不信答案。RAGFlow 和 Kotaemon 都做了，自建 UI 的话第一天就要上 citations。
解析没验证就开 ingest — 推 200 篇 PDF 进嵌入器之前，手动打开 5 篇随机的解析输出看一眼。坏解析污染索引是不可逆的。
忘了重排 — 几乎每个团队都是抱怨完检索质量后第 3 周才加 Cohere Rerank。第 1 周就加。

安装 · 一行命令

$ tokrepo install pack/pdf-research-paper-rag

丢给 agent，或粘到终端

包内含什么

10 个资产打包就绪

Skill#01

Zerox — Zero-Shot PDF OCR for AI Pipelines

Extract text from any PDF using vision models as OCR. Zerox converts PDF pages to images then uses GPT-4o or Claude to extract clean markdown without training.

by Script Depot·362 views

$ tokrepo install zerox-zero-shot-pdf-ocr-ai-pipelines-3ac555d9

Skill#02

OpenDataLoader PDF — AI-Ready Document Parser

An open-source PDF parser that automates document accessibility and extracts structured, AI-ready data including tables, text, bounding boxes, and tagged content.

by AI Open Source·293 views

$ tokrepo install opendataloader-pdf-ai-ready-document-parser-841f15d1

Skill#03

Surya — Document OCR for 90+ Languages

Surya is a document OCR toolkit with 19.5K+ GitHub stars. Text recognition in 90+ languages, layout analysis, table detection, reading order, and LaTeX OCR. Benchmarks favorably against cloud OCR serv

by Script Depot·601 views

$ tokrepo install surya-document-ocr-90-languages-66bc0630

Skill#04

Cherry Studio Knowledge Base — Local RAG with 50+ Formats

Cherry Studio Knowledge Base ingests PDFs, Office docs, Markdown into a local vector index. Query offline, BYOK any LLM. Data stays on your machine.

by Cherry Studio·411 views

$ tokrepo install cherry-studio-knowledge-base-local-rag-with-50-formats

Skill#05

Pinecone Assistant — Managed RAG Service with Auto-Indexing

Pinecone Assistant is the fully managed RAG product on Pinecone. Upload PDFs, query with natural language, get cited answers — no chunking pipeline.

by Pinecone·209 views

$ tokrepo install pinecone-assistant-managed-rag-service-with-auto-indexing

Skill#06

PageIndex — Document Index for Reasoning-Based RAG

A document indexing system that enables vectorless retrieval-augmented generation by building structured page-level indexes for LLM reasoning.

by AI Open Source·206 views

$ tokrepo install pageindex-document-index-reasoning-based-rag-7421307d

Skill#07

RAGFlow — Deep Document Understanding RAG Engine

Open-source RAG engine with deep document understanding. Parses complex PDFs, tables, images. Agent-powered Q&A with citations. Multi-model. 77K+ stars.

by Script Depot·485 views

$ tokrepo install ragflow-deep-document-understanding-rag-engine-7785d7a8

Skill#08

Kotaemon — Open-Source RAG Document Chat

Clean, open-source RAG tool for chatting with your documents. Supports PDF, DOCX, web pages. Multi-model, citation, and multi-user. Self-hostable. 25K+ stars.

by Script Depot·402 views

$ tokrepo install kotaemon-open-source-rag-document-chat-b0f93b10

Skill#09

Cohere Rerank — Boost RAG Accuracy with Rerank-3

Cohere Rerank scores candidates against a query using a cross-encoder. Drop into any RAG to boost top-1 hit rate by 30-50% over vector search alone.

by Cohere·293 views

$ tokrepo install cohere-rerank-boost-rag-accuracy-with-rerank-3

Skill#10

PDFMathTranslate — Translate PDF Papers Preserving Format

Translate PDF scientific papers while preserving math formulas, charts, and layout. Supports Google, DeepL, OpenAI, Ollama. CLI, GUI, MCP, Docker, Zotero plugin.

by Script Depot·483 views

$ tokrepo install pdfmathtranslate-translate-pdf-papers-preserving-format-4c628f43

常见问题

十个工具我必须全装吗？能不能先装 2-3 个？

先装三个：一个解析器（原生数字 PDF 选 OpenDataLoader PDF，脏扫描选 Zerox），一个索引（笔记本规模选 Cherry Studio 知识库），一个聊天 UI（Kotaemon）。这个三件套一下午就能跑起一个能用的多 PDF 对话。第二周觉得检索质量是瓶颈时加 Cohere Rerank，再加 PageIndex 应对长文档，最后用 PDFMathTranslate 处理外文论文。整套 10 个只在语料超过几百份时才有意义。

和「PhD 博士研究者文献 + 复现代码包」有啥区别？

研究流程的不同阶段。PhD 那个 pack 解决文献检索、文献管理、跑通论文代码（Zotero、arXiv MCP、GPT Researcher、JupyterLab、AI Scientist）。本 pack 假设你已经把 PDF 攒在文件夹里了，要从中规模化抽出结构化信息 — 这意味着一条真正的 RAG 流水线：解析、索引、检索、重排。很多研究者两个都用：PhD pack 收论文，本 pack 拷问它们。

法律合同、病例这种机密文档安全吗？

如果坚持本地优先的栈，安全。Surya 在本地跑 OCR；Cherry Studio 知识库和 Kotaemon 都能跑全本地（Ollama / llama.cpp 后端）；RAGFlow 可以 Docker 自托管在内网。云端那几个（Pinecone Assistant、Cohere Rerank、Zerox via GPT-4o / Claude）会把文本送出去，只给非机密语料用。TokRepo 上的「律师 AI 合同审查工具包」对隐私优先的工具有更深的覆盖。

PDF 里的表格和图，这些工具真的能抽出来吗？

表格是 PDF 解析最难的部分。开源选项里 RAGFlow 自带的表格解析器最强；OpenDataLoader PDF 在源 PDF tag 良好时能把表格结构保留成 JSON；Zerox 因为视觉模型像人一样看页面，复杂版式能扛。图表和公式更难 — 公式当前 PDFMathTranslate 是开源最好的，图大多数团队的妥协是保留图片引用，让聊天 UI 跳到原页。

从一个 PDF 文件夹到可用聊天 UI，大概多久？

笔记本上用 Cherry Studio 知识库或 Kotaemon，小语料（50 篇以下原生数字 PDF）大概 30 分钟能开始对话 — 大部分时间花在首次解析和嵌入。大语料（500 篇带扫描和表格的）要几小时流水线工作：先用 OpenDataLoader 跑一遍，失败的回落 Zerox 再跑一遍，ingest 进 RAGFlow，然后调 chunk size 和重排器。之后加一篇新 PDF 的边际成本是秒级。

更多主题包

12 个主题包 · 80+ 精选资产

回首页浏览全部精选合集

返回主题包总览

PDF + 论文 RAG 工具包

这个 pack 包含什么

推荐安装顺序

阶段 1 — 解析（把 PDF 变成干净的 markdown）

阶段 2 — 索引（把解析后的文本嵌入并存下来）

阶段 3 — 对话（用户面）

阶段 4 — 重排 + 翻译