# Tavily Extract — Pull Clean Content from Any URL > Tavily Extract converts up to 20 URLs into LLM-ready markdown in one API call. Skips ads, navigation, footers. Returns clean prose with citation metadata. ## Install Copy the content below into your project: ## Quick Use 1. Already have a Tavily API key (from search asset) 2. `client.extract(urls=[...], extract_depth="advanced")` — pass up to 20 URLs 3. Iterate `response["results"]` for clean markdown per URL --- ## Intro Tavily Extract takes a list of URLs and returns clean LLM-ready markdown — no HTML, no ads, no nav menus, no cookie banners. Up to 20 URLs per call, with `extract_depth: advanced` for tricky sites. Best for: agents that have a list of URLs (from Search, your own sources, or user input) and need the actual content. Works with: Tavily REST API, Python / TypeScript SDK. Setup time: 2 minutes. --- ### Extract clean content ```python from tavily import TavilyClient client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"]) response = client.extract( urls=[ "https://docs.anthropic.com/en/docs/claude-code", "https://docs.cursor.com/composer", "https://docs.continue.dev/intro", ], extract_depth="advanced", # vs "basic" — slower but cleaner on JS-heavy sites include_images=False, ) for result in response["results"]: print(result["url"]) print(result["raw_content"][:500]) print(f"({len(result['raw_content'])} chars total)") # Failed URLs (404, blocked, etc) listed separately for failed in response["failed_results"]: print(f"FAILED: {failed['url']} — {failed['error']}") ``` ### Pair with Search for full RAG ```python search = client.search(query="claude code subagents best practices", max_results=10) urls = [r["url"] for r in search["results"]] # Get full content for top 5 extracts = client.extract(urls=urls[:5], extract_depth="advanced") # Now feed both summaries (from search) and full text (from extract) to an LLM context = "\n\n".join(e["raw_content"] for e in extracts["results"]) ``` ### Cost vs Search | Endpoint | Cost | Output | |---|---|---| | `/search` | 1-2 credits | Snippets + answer + URLs | | `/extract` (basic) | 1 credit / URL | Full markdown of 1 URL | | `/extract` (advanced) | 2 credits / URL | Full markdown of 1 URL, JS rendering | For RAG: use Search to find URLs, Extract for the ones worth deep-reading. Don't extract every search result — most are summary-quality already in Search output. --- ### FAQ **Q: How is Tavily Extract different from Firecrawl?** A: Both produce LLM-ready markdown. Firecrawl is dedicated to scraping with more knobs (Crawl, Map, structured Extract via schema). Tavily Extract is the URL-to-content companion of Tavily Search, optimized for batch extraction during agent runs. Different ergonomics, similar output. **Q: Does it handle paywalls?** A: No — Tavily Extract respects paywalls. It returns the public preview content, not the paywalled article. For internal authenticated sources, use Tavily's enterprise tier with custom auth. **Q: Can I extract images?** A: Yes — set `include_images=True`. The response includes image URLs and alt text. Images are linked, not downloaded; you'd fetch them separately if needed. --- ## Source & Thanks > Built by [Tavily](https://github.com/tavily-ai). Commercial product with free tier. > > [tavily.com/docs/extract](https://docs.tavily.com/docs/api-reference/endpoint/extract) — Extract docs --- ## 快速使用 1. 已有 Tavily API key(来自 search 资产) 2. `client.extract(urls=[...], extract_depth="advanced")`,最多 20 个 URL 3. 遍历 `response["results"]` 拿每个 URL 的干净 markdown --- ## 简介 Tavily Extract 拿一组 URL 返回干净的 LLM 直接能用的 markdown —— 没 HTML、没广告、没导航、没 cookie banner。每次最多 20 个 URL,`extract_depth: advanced` 用于难站。适合 agent 已经有 URL 列表(来自 Search、自己的源、或用户输入)需要实际内容的场景。兼容 Tavily REST API / Python / TypeScript SDK。装机时间 2 分钟。 --- ### 提取干净内容 ```python from tavily import TavilyClient client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"]) response = client.extract( urls=[ "https://docs.anthropic.com/en/docs/claude-code", "https://docs.cursor.com/composer", "https://docs.continue.dev/intro", ], extract_depth="advanced", # vs "basic" —— 慢但 JS 重的站更干净 include_images=False, ) for result in response["results"]: print(result["url"]) print(result["raw_content"][:500]) print(f"({len(result['raw_content'])} chars total)") # 失败的 URL(404、被拦等)单独列 for failed in response["failed_results"]: print(f"FAILED: {failed['url']} —— {failed['error']}") ``` ### 配 Search 做完整 RAG ```python search = client.search(query="claude code subagents best practices", max_results=10) urls = [r["url"] for r in search["results"]] # 给 top 5 拿完整内容 extracts = client.extract(urls=urls[:5], extract_depth="advanced") # 现在把摘要(来自 search)+ 全文(来自 extract)一起喂 LLM context = "\n\n".join(e["raw_content"] for e in extracts["results"]) ``` ### 跟 Search 的成本对比 | 端点 | 成本 | 输出 | |---|---|---| | `/search` | 1-2 credits | 片段 + 答案 + URL | | `/extract`(basic) | 1 credit / URL | 1 个 URL 的完整 markdown | | `/extract`(advanced) | 2 credits / URL | 1 个 URL 的完整 markdown,JS 渲染 | RAG:用 Search 找 URL,对值得深读的用 Extract。别每个 search 结果都 extract —— 多数 search 输出本身已经是摘要质量。 --- ### FAQ **Q: Tavily Extract 跟 Firecrawl 啥区别?** A: 都出 LLM 直接能用的 markdown。Firecrawl 专做抓取,旋钮更多(Crawl / Map / schema 化结构提取)。Tavily Extract 是 Tavily Search 的「URL → 内容」配套,为 agent 跑批量提取优化。人体工学不同,输出类似。 **Q: 能处理付费墙吗?** A: 不能 —— Tavily Extract 尊重付费墙。返回公开预览内容,不是付费文章。要内部带认证的源,用 Tavily 企业档配自定义鉴权。 **Q: 能提取图片吗?** A: 能 —— 设 `include_images=True`。响应带图片 URL 和 alt 文字。图片是链接不是下载;你想要就单独 fetch。 --- ## 来源与感谢 > Built by [Tavily](https://github.com/tavily-ai). Commercial product with free tier. > > [tavily.com/docs/extract](https://docs.tavily.com/docs/api-reference/endpoint/extract) — Extract docs --- Source: https://tokrepo.com/en/workflows/tavily-extract-pull-clean-content-from-any-url Author: Tavily