Pack PhD : Littérature + Code de Recherche
Dix picks pour le doctorant qui fait une vraie revue de littérature et essaie de reproduire le code des articles : Zotero, arXiv MCP, GPT Researcher, agent academic-researcher, Marker, Nougat, JupyterLab, Papermill, Overleaf, AI Scientist. Recherche → gestionnaire → parsing PDF → lecture → reproduction → rédaction.
What's in this pack
This is the rig for the PhD student or postdoc who is past the "chat with ChatGPT about my topic" phase and into the much harder work of (a) reading 200 papers properly, (b) tracking what cites what, (c) actually running the code the authors released, and (d) eventually writing something defensible. Every pick here is open-source, actively maintained, and earns its slot in the pipeline.
The sharp edge of this pack is that it refuses to pretend AI is a substitute for reading the methodology section. AI is in the loop for lit triage, PDF cleanup, first-pass summarization, code-repro debugging, and drafting — but a PhD who doesn't actually read the methods is a PhD whose thesis defense goes badly. The tools are arranged so the AI never sits between you and the paper itself, only around it.
Install in this order
- Zotero — reference manager. Start here, on day one of the PhD. Browser connector grabs metadata + PDF in one click, organizes into collections, syncs across devices, generates BibTeX. If you don't have a single source of truth for citations from week one, you will pay for it in month 36.
- arXiv MCP Server — programmatic paper search from inside Claude / Cursor / any MCP-aware client. Search arXiv, fetch metadata and full text, hand a paper to the model with one tool call. The replacement for "open browser, search, copy DOI, paste back".
- GPT Researcher — autonomous lit-review agent. Given a query ("transformer scaling laws compute-optimal training"), it searches multiple sources, synthesizes findings, cites references, produces a draft survey. Use as the first-pass map of an unfamiliar subfield — never as the final citation list.
- Claude Code Agent: Academic Researcher — a Claude Code subagent tuned for academic workflows: structured paper reading, methodology extraction, citation graph traversal. Lives in your Claude Code project so prompts and conventions are version-controlled with your thesis repo.
- Marker — PDF → clean Markdown converter. The single biggest unlock for AI-assisted reading. Marker handles math, tables, figures, multi-column layouts. Convert a 40-page paper to Markdown once, then any LLM can ingest it cleanly without OCR noise eating the methodology.
- Nougat — Meta's neural OCR specifically trained on academic documents. Where Marker is fast and general, Nougat is the heavyweight for equation-dense papers (theoretical ML, physics, math). LaTeX-aware output. Use it when Marker garbles a critical proof.
- JupyterLab — the notebook IDE where you actually run the paper's released code, modify it, plot variants, sanity-check claims. Multi-document workspace, terminal, file browser. Where reproducibility either happens or doesn't.
- Papermill — parameterize and execute notebooks from the command line. Critical when you need to sweep the paper's hyperparameter across 12 settings to verify the headline figure isn't a single-seed accident. Pairs with JupyterLab for production-grade experiment runs.
- Overleaf (self-hosted) — collaborative LaTeX. The actual writing environment. Self-hosted variant keeps your unpublished thesis off a third-party server, which matters in fields with strict IP / embargo rules. BibTeX flows in directly from Zotero.
- AI Scientist — Sakana AI's automated end-to-end paper generation system. Not for generating your actual thesis (don't), but a fascinating reference for what the frontier of AI-assisted scientific writing looks like, and a useful tool for generating ablation-experiment writeup drafts you then heavily edit.
How they fit together (research workflow)
Lit search
┌────────────────────────────────────┐
│ arXiv MCP ──► GPT Researcher │
│ (precise) (broad map) │
└─────────────────┬──────────────────┘
▼
┌───────────────────┐
│ Zotero (truth) │ ◄── BibTeX out to Overleaf
│ collections + │
│ attached PDFs │
└─────────┬─────────┘
▼
PDF parse ┌──────────────────┐
│ Marker (fast) │
│ Nougat (math) │
└────────┬─────────┘
▼ clean markdown
┌─────────────────────────┐
│ Academic Researcher │
│ Claude Code agent │ ── summary, citation graph, gaps
└──────────┬──────────────┘
▼
Reproduce code
┌───────────────────┐
│ JupyterLab │
│ + Papermill │ ── seed sweeps, ablations
└────────┬──────────┘
▼
Writing
┌───────────────────┐
│ Overleaf │ ◄── citations from Zotero
│ + AI Scientist │ (draft only — you write)
└───────────────────┘
The spine is Zotero as the single source of truth for what you've read. Everything upstream feeds Zotero; everything downstream reads from it. Without that discipline, the whole pipeline rots into a 4,000-tab browser and a thesis you can't reproduce.
Tradeoffs you'll hit
- AI summarizing vs actually reading — The biggest risk in this pack. GPT Researcher and the Academic Researcher agent will happily summarize a paper in 30 seconds. That summary is good enough to decide whether to read the paper and dangerously misleading as a substitute for reading the methodology. Hard rule: if you cite a paper in your thesis, you read the methods section unaided. AI is for triage, not for cite-by-vibes.
- Reproducibility ceiling — Papermill + JupyterLab let you run released code cleanly, but plenty of papers release code that no longer runs (dead dependencies, missing weights, wrong CUDA version). Budget time for environment archaeology. Pin everything in a
conda env export. If a paper's claim collapses on rerun, that's a finding worth a footnote. - Marker vs Nougat — Marker is faster and handles tables well; Nougat is slower but actually parses LaTeX equations correctly. Run Marker first; reach for Nougat only when the math is the point.
- Self-hosted Overleaf vs the SaaS — SaaS Overleaf is convenient but your draft is on someone else's machine. Self-hosted on your university cluster (or just a Docker container) is the right call for unpublished work. The cost is one afternoon of setup.
- AI Scientist as a tool, not a goal — Generating papers end-to-end with AI is academically and ethically fraught. Treat it as a reference architecture for what's possible, and as a draft-generator for ablation tables — never as a way to bypass the actual scientific contribution.
Common pitfalls
- Over-trusting an AI summary of a methodology — Summarizers compress; methodology details (loss formulation, regularization, data splits) are exactly what gets compressed away. Reviewers ask about exactly the details a summary drops. Read the methods.
- Zotero PDFs scattered across devices — turn on WebDAV / your own sync target on day one. Discovering on year 3 that half your annotated PDFs only exist on a dead laptop is the canonical PhD horror story.
- Notebook-only reproduction — a paper's
figure_3.ipynbmay run end-to-end but skip the actual training. Read what the notebook does before declaring "reproduced". - arXiv-only literature — arXiv is fast but biased toward ML / physics / math. For most of biology, social science, and humanities, the lit lives in journals reachable only via institutional access. Use the arXiv MCP for what arXiv covers, not as a universal source.
- Conflating BibTeX entries — Zotero will happily import the same paper twice with slightly different metadata if you click the connector on both arXiv and the journal version. Run a duplicate check before every chapter handoff.
10 ressources prêtes à installer
Questions fréquentes
I'm at the start of my PhD — do I really need all ten of these on day one?
No — install Zotero, JupyterLab, and Overleaf in week one, because those three become muscle memory and migration cost compounds. Add arXiv MCP and the academic-researcher agent in month two once you've found your subfield. Marker, Nougat, Papermill, and AI Scientist arrive when you hit the specific problem each solves — don't preinstall solutions to problems you don't have yet.
Can an AI agent actually do my literature review for me?
Not in any way that survives a thesis defense. GPT Researcher and the academic-researcher agent are excellent at producing a first-pass map of an unfamiliar field — that map is roughly the quality of a third-year undergrad's literature review. Use it to find the seminal papers and identify the major camps, then read those papers yourself. Submitting an AI-generated review as your literature chapter is plagiarism in most universities and intellectual self-sabotage in all of them.
Marker or Nougat — which PDF-to-text tool should I install first?
Install Marker first. It's faster, handles tables and figures well, and covers 90% of papers acceptably. Add Nougat when you start working with equation-heavy theoretical papers — Nougat was trained specifically on academic documents and preserves LaTeX math far better. Running both and picking per-paper is also fine; storage and compute are cheap, missed equations are not.
How do I keep my PhD reproducible if I'm running 50 different notebooks across different papers?
Three rules. (1) Every reproduction lives in its own directory with its own environment.yml or requirements.txt pinned to exact versions. (2) Use Papermill to invoke notebooks via parameters rather than editing in-place — the source notebook stays clean, the run record stays auditable. (3) Save the executed notebook + outputs alongside the input parameters, so two years later you can prove what you ran. Conda environments, git, and a RUNS/ directory of executed Papermill outputs solve 95% of reproducibility pain.
Is it ethical to use AI Scientist or Claude to help write my thesis?
Depends entirely on your university's policy and your honest disclosure. Common consensus as of 2026: AI is fine for outlining, grammar, idea-stress-testing, and generating draft prose you then heavily rewrite — the same way a writing tutor would help. AI is not fine for generating original analysis, fabricating citations, or producing prose you submit unedited. When in doubt, disclose in the methods section. The point of a PhD is that you can defend every sentence; if you can't defend a paragraph an AI wrote, don't include it.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs