Startup VC Research Toolkit — Investor Due Diligence AI Stack
Ten picks for the early-stage VC, angel, or FoF analyst who has to source deals on Monday, dig a company on Tuesday, background-check the founder on Wednesday, read three 10-Ks on Thursday, and ship the IC memo on Friday. Deep research + company intel + founder background + filings + memo synthesis. AI handles the slog; you keep the call.
What's in this pack
This is the stack for an early-stage investor who already has 40 emails, 12 decks in the inbox, and one IC slot left on Friday — not a thought-leader's "AI for VCs" essay. Every pick here does one job in a real diligence week: open a sourcing thesis, dig signals on a private company, sanity-check the founder's last 3 jobs, read the 10-K of the public comp, and draft the memo your partners will actually red-line.
The pack is agent-assisted on purpose, not autonomous. Investors don't lose LPs over speed; they lose LPs over conviction in the wrong companies. So the workflow keeps you in the loop at every judgment point — AI handles the bulk search, scraping, transcription, and first drafts; you read the primary sources and own the price.
No "learn LangGraph for two weeks first" — most of these install as a Claude Code subagent or an MCP server. Pick one deep-research agent, one search API, one scraper, one filings tool, and you can use the whole pack on Monday morning.
What's in this pack — 5 layers
Layer 1 — Sourcing / deep research (broad sweep on a thesis or category)
- GPT Researcher — autonomous lit-review agent for "who's building X right now"
- STORM — multi-perspective expert simulation for mapping a category from cold
- Perplexity Sonar API — search-grounded LLM in a single call when you need a fast cited answer
Layer 2 — Company intel (turning a name into evidence)
- Exa — AI-native search API for surfacing companies, blog posts, hiring signals
- Tavily Search — search API built for agent RAG pipelines, cleaner JSON than scraping SERPs
- Web-Check — OSINT dashboard pulling DNS, tech stack, headers, certs, archive history on any domain
Layer 3 — Founder background + scraping
- Apify MCP Server — 8,000+ pre-built scrapers (LinkedIn, GitHub, news, reviews) callable from an agent
Layer 4 — Financials / public filings
- SEC EDGAR MCP Server — query 10-K, 10-Q, S-1, 8-K filings from agents for late-stage / pre-IPO comps
- OpenBB — open-source investment research platform aggregating equities, crypto, macro, alt-data
Layer 5 — Memo / IC synthesis
- Claude Code Agent: Research Analyst — synthesize all of the above into the structured IC memo paragraphs
Install in this order
- GPT Researcher — Autonomous Research Report Agent — start here, because nothing else matters if your sourcing thesis is mush. Give it "open-source agent frameworks raising seed in 2026" or "who's building local-first sync engines" and you get a multi-source, cited report you can paste into Notion. Apache-2.0, Python package, 26k+ GitHub stars. Burns OpenAI + Tavily tokens; budget a few cents per report.
- STORM — AI Research Report Generator (Stanford) — when the category itself is unfamiliar ("what is *-as-code in security and who matters"), STORM's multi-perspective expert conversation surfaces angles a single prompt misses. Stanford-built, Wikipedia-style cited output. Use it before GPT Researcher when you don't even know the right sub-questions yet.
- Perplexity Sonar API — Search-Grounded LLM in One Call — the fast cited answer for "is this founder's last company actually acquired or did they shut down". Single API call, citations included, much faster than running a full research agent for a yes/no question. Pay-per-call so it scales cleanly.
- Exa — AI-Native Search API for Agent Pipelines — the search backend you wire into your own diligence agent when SERP-scraping breaks. Returns structured results with content extraction, neural search for "find startups similar to ". Use it as the search layer your other agents call.
- Tavily Search — Search API Built for AI Agents — the alternative search backend. Cleaner for general web RAG, generous free tier. Pair with Exa: Tavily for breadth, Exa for neural-similarity queries. GPT Researcher uses Tavily under the hood by default.
- Web-Check — All-in-One Website OSINT and Analysis Dashboard — the moment a deck arrives, run the company domain through Web-Check. DNS history, hosting, tech stack, security headers, certs, archived versions. Tells you in 60 seconds whether the company is 6 weeks old, who hosts them, and whether their stack matches the deck claims.
- Apify MCP Server — 8,000+ Web Scrapers for Agents — founder background pulls (LinkedIn job history, GitHub commit activity, Twitter/X reach), customer-review scrapes (G2, Capterra), and news monitoring all via one MCP server. Pay-per-result rather than per-scraper. Read each scraper's ToS notes before running — LinkedIn especially.
- SEC EDGAR MCP Server — Query Filings from Agents — when you're benchmarking a Series B against the late-stage / public comp, this lets your agent pull the comp's last 10-K, 10-Q, or S-1 and answer "what's their revenue growth, gross margin, and rule-of-40 trend" without you opening edgar.gov 30 times. Free; EDGAR has no API key.
- OpenBB — Open-Source Investment Research Platform — the financial-data hub: equities, options, crypto, macro, alt-data, all behind a unified Python / CLI. Useful for sector benchmarking ("how is the public dev-tools index trading vs 2023"), and for the chart you need in slide 12 of the memo.
- Claude Code Agent: Research Analyst — the synthesis layer. Feed it the GPT Researcher report from step 1, the Web-Check + Apify pulls from steps 6-7, the EDGAR comp data from step 8, and the agent produces the structured IC memo sections (thesis, market, team, traction, risks, ask). First draft only — your job is to push back on the parts that hallucinate revenue numbers or smooth over the team gap.
How they fit together
Sourcing thesis
┌──────────────────────────────────────────┐
│ STORM (cold map) ──► GPT Researcher │
│ (deep sweep, cited) │
│ Perplexity Sonar (fast Q&A) │
└──────────────────┬───────────────────────┘
│
▼ names / domains / founders
┌──────────────────────────────────────────┐
│ Exa + Tavily ── search layer for agents │
│ Web-Check ──── OSINT on the domain │
│ Apify MCP ──── founder + customer scrapes │
└──────────────────┬───────────────────────┘
▼ evidence corpus per company
┌──────────────────────────────────────────┐
│ SEC EDGAR MCP ── public comp filings │
│ OpenBB ──────── sector / market context │
└──────────────────┬───────────────────────┘
▼
Claude Code Agent: Research Analyst
(IC memo draft: thesis · market
· team · traction · risks · ask)
│
▼
Partner review / IC
The critical join is signal → deal → due diligence → memo → IC. Each handoff is where investors lose the thread: the sourcing thesis never becomes a real deal pipeline, the diligence pulls never make it into the memo, the memo arrives at IC without comp benchmarks. These ten tools are picked so you can walk the chain end-to-end in one tool surface.
Tradeoffs you'll hit
- Data freshness vs cost — Perplexity Sonar and GPT Researcher both pull live web; private-company data is often weeks-to-quarters stale (last fundraise, headcount, customer count). For Series A and earlier, the founder's primary deck is more current than any scraper — use AI to verify and contextualize, not to replace the call. EDGAR is current to filing date; OpenBB market data is delayed unless you wire a paid feed.
- Scraping legality — Apify scrapers for LinkedIn, Glassdoor, and Crunchbase violate those sites' ToS even when technically legal in your jurisdiction. The platform-cost story is one thing; the litigation story (hiQ vs LinkedIn, Bright Data vs Meta) is another. Two safer postures: (a) use the platform's official API where one exists (LinkedIn Sales Navigator API for funded firms), or (b) limit scraping to data the founder will hand you anyway and use Apify for adjacent signals (GitHub, news, reviews). If you scrape, scrape with attribution and rate limits.
- AI hallucination on private companies — LLMs will confidently state a private company's ARR, headcount, and last round if you ask. Often wrong. For private-co claims, require a citation that's either (a) a primary source the company published or (b) a paywalled DB you have a license to (PitchBook, CB Insights, Sourcescrub). Treat any unsourced AI claim about a private company as suspect — the model is interpolating from old TechCrunch articles.
- Memo template lock-in — Research Analyst and Report Generator agents both default to a generic IC memo template. If your firm has a battle-tested template (sections, scoring rubric, comp table format), paste it as context and the agent will conform. If you don't have one, use the agent's default and start a template. Don't let the AI design your memo format — that's your investment-committee taste, not a productivity question.
- OpenBB vs paid data — OpenBB is free and aggregates a lot, but the deeper sector / private-market data is behind paid integrations (FMP, Polygon, Tiingo). For seed/Series A diligence, OpenBB's free public-market data is usually enough as context. For late-stage / growth, budget for a real data subscription.
- STORM vs just asking Claude with web search — Claude with web search is faster for a single question. STORM earns its slot when you need (a) a cited report you'll attach to the memo, (b) a category map across 8-12 sub-questions in one pass, or (c) the multi-perspective simulation that surfaces angles a single prompt misses. For sourcing-thesis work, the citation trail matters — partners will challenge claims.
Common pitfalls
- Not verifying sources — AI search agents return citations. They are not all real. GPT Researcher and Perplexity both occasionally hallucinate URLs or attribute a quote to the wrong article. Always click through citations before pasting into a memo. The five minutes you save by trusting the agent become the IC meeting where a partner clicks the link and finds it's a dead page.
- Scraping in violation of ToS without a legal review — hiQ Labs vs LinkedIn is a multi-year case and the precedent is still narrow. Apify will happily run any of its 8,000 scrapers; that doesn't mean your firm should run all of them. Bring a one-pager to your GC about which sites you're scraping and at what rate, especially for any data-room or data-product you'll later sell to LPs.
- AI fabricating founder background — the most dangerous failure mode. "This founder previously sold a company to Salesforce" is the kind of claim a model will manufacture from one ambiguous LinkedIn line. Always verify founder employment history against the source (LinkedIn profile direct, the acquirer's press release, the founder's own bio). Never paste an AI-summarized founder bio into a memo without spot-checking three claims.
- Memo template applied too rigidly — IC memos exist to provoke the right argument at the IC meeting. If the agent gives you a beautifully structured memo and the actual conversation should be "this is a category bet not a company bet", rewrite the structure. The template is a starting point, not the final shape of the argument.
- Building the agent stack before sourcing a deal — the worst investor failure mode is research-procrastination dressed up as tooling. Pick three tools (GPT Researcher + Web-Check + Research Analyst), open a thesis on Monday, surface 5 names by Wednesday, dig 2 by Friday. The other seven tools earn their slot only when the first three are saturated.
10 assets in this pack
Frequently asked questions
Can AI actually read Crunchbase, PitchBook, or Sourcescrub data?
Only if your firm has a license and the platform's API or export. Crunchbase has a paid API (Enterprise tier ~$400/mo+); PitchBook and Sourcescrub require enterprise licenses and don't have open APIs. Once you have programmatic access, you can pipe the data into your own agent context (a Claude Code subagent reading a Crunchbase JSON export works well). Scraping these sites without a license is both a ToS violation and, in some interpretations, a CFAA risk — do not do it. For early-stage diligence without a paid DB, Apify scrapers on public sites (LinkedIn employee count, GitHub activity, company website) plus founder-provided data is the realistic substitute. The AI doesn't make up for missing data sources; it makes the sources you legitimately have go further.
What's the minimum number of independent sources to background-check a founder?
Three, and they need to be genuinely independent. (1) LinkedIn for employment history — verify titles and dates against the company's own announcements where possible. (2) A search on the founder's name + 'lawsuit', + 'fired', + 'scandal' across Exa and Perplexity — most reputational issues surface here if they exist. (3) 2-3 backchannel references from people who worked with the founder but are not on the reference list the founder gave you. The 'backchannel' is the one AI cannot replace; LinkedIn warm-intro plus a real 15-minute call beats any agent output. AI tools are leverage for steps 1 and 2; step 3 is the actual signal. Anyone telling you a fully-automated founder background-check is sufficient is selling a product, not doing diligence.
Is the Apify LinkedIn scraper compliant?
It is a gray area in flux. hiQ Labs vs LinkedIn (9th Circuit, 2022) found that scraping publicly-accessible LinkedIn data is not a CFAA violation, but LinkedIn's terms of service still prohibit it and LinkedIn has pursued civil claims against scrapers. The practical investor posture: (a) get a written sign-off from your GC on what data sources you scrape and at what rate, (b) prefer official APIs where available (LinkedIn Sales Navigator API, GitHub API, Twitter API), (c) document data lineage for anything that ends up in an LP-facing artifact, (d) treat scraped data as a lead, not as a fact for the memo — re-verify in a primary source before citing. If you can't pass a 'how did you get this' question from an LP, don't put it in the memo.
How long should an IC memo be when the first draft is AI-generated?
Shorter than the AI wants to give you. The Research Analyst agent will happily produce 4,000 words. Real IC memos at most early-stage firms are 2-4 pages of dense prose, plus appendix exhibits. The AI's job is to compress the diligence corpus into the right 1,500 words; your job is to ruthlessly cut anything the partners don't need to make the call. Sections that earn their place: thesis (what you believe and why), market (TAM with the assumption stated, not the TechCrunch number), team (the specific reason these humans win this market), traction (the one chart that matters), risks (the three things that kill it), ask (round, price, ownership). Everything else goes in an appendix that nobody reads but everybody asks about. If your IC memos are longer than 4 pages, the AI is adding noise, not value.
What single step in due diligence is most worth automating with these tools?
Initial pass on inbound decks: company-domain OSINT (Web-Check), founder LinkedIn pull (Apify), public-comp pricing-page scan (Exa + Tavily) — all kicked off the moment a deck arrives, with a 1-page brief in your inbox before you read the deck yourself. This compresses the 'is this worth a first call' decision from 90 minutes of manual googling to a 15-minute structured read, and it scales with deck volume in a way human attention does not. The deepest diligence (reference calls, customer calls, financial modeling) is the last thing to automate — those are signal-generating activities, not data-gathering ones.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs