# Tavily Extract — Pull Clean Content from Any URL

> Tavily Extract converts up to 20 URLs into LLM-ready markdown in one API call. Skips ads, navigation, footers. Returns clean prose with citation metadata.

## Install

Copy the content below into your project:

## Quick Use

1. Already have a Tavily API key (from search asset)
2. `client.extract(urls=[...], extract_depth="advanced")` — pass up to 20 URLs
3. Iterate `response["results"]` for clean markdown per URL

---

## Intro

Tavily Extract takes a list of URLs and returns clean LLM-ready markdown — no HTML, no ads, no nav menus, no cookie banners. Up to 20 URLs per call, with `extract_depth: advanced` for tricky sites. Best for: agents that have a list of URLs (from Search, your own sources, or user input) and need the actual content. Works with: Tavily REST API, Python / TypeScript SDK. Setup time: 2 minutes.

---

### Extract clean content

```python
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls=[
        "https://docs.anthropic.com/en/docs/claude-code",
        "https://docs.cursor.com/composer",
        "https://docs.continue.dev/intro",
    ],
    extract_depth="advanced",  # vs "basic" — slower but cleaner on JS-heavy sites
    include_images=False,
)

for result in response["results"]:
    print(result["url"])
    print(result["raw_content"][:500])
    print(f"({len(result['raw_content'])} chars total)")

# Failed URLs (404, blocked, etc) listed separately
for failed in response["failed_results"]:
    print(f"FAILED: {failed['url']} — {failed['error']}")
```

### Pair with Search for full RAG

```python
search = client.search(query="claude code subagents best practices", max_results=10)
urls = [r["url"] for r in search["results"]]

# Get full content for top 5
extracts = client.extract(urls=urls[:5], extract_depth="advanced")

# Now feed both summaries (from search) and full text (from extract) to an LLM
context = "\n\n".join(e["raw_content"] for e in extracts["results"])
```

### Cost vs Search

| Endpoint | Cost | Output |
|---|---|---|
| `/search` | 1-2 credits | Snippets + answer + URLs |
| `/extract` (basic) | 1 credit / URL | Full markdown of 1 URL |
| `/extract` (advanced) | 2 credits / URL | Full markdown of 1 URL, JS rendering |

For RAG: use Search to find URLs, Extract for the ones worth deep-reading. Don't extract every search result — most are summary-quality already in Search output.

---

### FAQ

**Q: How is Tavily Extract different from Firecrawl?**
A: Both produce LLM-ready markdown. Firecrawl is dedicated to scraping with more knobs (Crawl, Map, structured Extract via schema). Tavily Extract is the URL-to-content companion of Tavily Search, optimized for batch extraction during agent runs. Different ergonomics, similar output.

**Q: Does it handle paywalls?**
A: No — Tavily Extract respects paywalls. It returns the public preview content, not the paywalled article. For internal authenticated sources, use Tavily's enterprise tier with custom auth.

**Q: Can I extract images?**
A: Yes — set `include_images=True`. The response includes image URLs and alt text. Images are linked, not downloaded; you'd fetch them separately if needed.

---

## Source & Thanks

> Built by [Tavily](https://github.com/tavily-ai). Commercial product with free tier.
>
> [tavily.com/docs/extract](https://docs.tavily.com/docs/api-reference/endpoint/extract) — Extract docs

---

<!-- ZH -->

## 快速使用

1. 已有 Tavily API key（来自 search 资产）
2. `client.extract(urls=[...], extract_depth="advanced")`，最多 20 个 URL
3. 遍历 `response["results"]` 拿每个 URL 的干净 markdown

---

## 简介

Tavily Extract 拿一组 URL 返回干净的 LLM 直接能用的 markdown —— 没 HTML、没广告、没导航、没 cookie banner。每次最多 20 个 URL，`extract_depth: advanced` 用于难站。适合 agent 已经有 URL 列表（来自 Search、自己的源、或用户输入）需要实际内容的场景。兼容 Tavily REST API / Python / TypeScript SDK。装机时间 2 分钟。

---

### 提取干净内容

```python
from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls=[
        "https://docs.anthropic.com/en/docs/claude-code",
        "https://docs.cursor.com/composer",
        "https://docs.continue.dev/intro",
    ],
    extract_depth="advanced",  # vs "basic" —— 慢但 JS 重的站更干净
    include_images=False,
)

for result in response["results"]:
    print(result["url"])
    print(result["raw_content"][:500])
    print(f"({len(result['raw_content'])} chars total)")

# 失败的 URL（404、被拦等）单独列
for failed in response["failed_results"]:
    print(f"FAILED: {failed['url']} —— {failed['error']}")
```

### 配 Search 做完整 RAG

```python
search = client.search(query="claude code subagents best practices", max_results=10)
urls = [r["url"] for r in search["results"]]

# 给 top 5 拿完整内容
extracts = client.extract(urls=urls[:5], extract_depth="advanced")

# 现在把摘要（来自 search）+ 全文（来自 extract）一起喂 LLM
context = "\n\n".join(e["raw_content"] for e in extracts["results"])
```

### 跟 Search 的成本对比

| 端点 | 成本 | 输出 |
|---|---|---|
| `/search` | 1-2 credits | 片段 + 答案 + URL |
| `/extract`（basic） | 1 credit / URL | 1 个 URL 的完整 markdown |
| `/extract`（advanced） | 2 credits / URL | 1 个 URL 的完整 markdown，JS 渲染 |

RAG：用 Search 找 URL，对值得深读的用 Extract。别每个 search 结果都 extract —— 多数 search 输出本身已经是摘要质量。

---

### FAQ

**Q: Tavily Extract 跟 Firecrawl 啥区别？**
A: 都出 LLM 直接能用的 markdown。Firecrawl 专做抓取，旋钮更多（Crawl / Map / schema 化结构提取）。Tavily Extract 是 Tavily Search 的「URL → 内容」配套，为 agent 跑批量提取优化。人体工学不同，输出类似。

**Q: 能处理付费墙吗？**
A: 不能 —— Tavily Extract 尊重付费墙。返回公开预览内容，不是付费文章。要内部带认证的源，用 Tavily 企业档配自定义鉴权。

**Q: 能提取图片吗？**
A: 能 —— 设 `include_images=True`。响应带图片 URL 和 alt 文字。图片是链接不是下载；你想要就单独 fetch。

---

## 来源与感谢

> Built by [Tavily](https://github.com/tavily-ai). Commercial product with free tier.
>
> [tavily.com/docs/extract](https://docs.tavily.com/docs/api-reference/endpoint/extract) — Extract docs


---
Source: https://tokrepo.com/en/workflows/tavily-extract-pull-clean-content-from-any-url
Author: Tavily