What is Tavily Extract — Pull Clean Content from Any URL?

Tavily Extract converts up to 20 URLs into LLM-ready markdown in one API call. Skips ads, navigation, footers. Returns clean prose with citation metadata.

Is Tavily Extract — Pull Clean Content from Any URL free to use?

Yes. Tavily Extract — Pull Clean Content from Any URL is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Tavily Extract — Pull Clean Content from Any URL?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Tavily Extract — Pull Clean Content from Any URL

from tavily import TavilyClient client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"]) response = client.extract( urls=[ "https://docs.anthropic.com/en/docs/claude-code", "https://docs.cursor.com/composer", "https://docs.continue.dev/intro", ], extract_depth="advanced", # vs "basic" — slower but cleaner on JS-heavy sites include_images=False, ) for result in response["results"]: print(result["url"]) print(result["raw_content"][:500]) print(f"({len(result['raw_content'])} chars total)") # Failed URLs (404, blocked, etc) listed separately for failed in response["failed_results"]: print(f"FAILED: {failed['url']} — {failed['error']}")

search = client.search(query="claude code subagents best practices", max_results=10) urls = [r["url"] for r in search["results"]] # Get full content for top 5 extracts = client.extract(urls=urls[:5], extract_depth="advanced") # Now feed both summaries (from search) and full text (from extract) to an LLM context = "\n\n".join(e["raw_content"] for e in extracts["results"])

Endpoint

Cost

Output

/search

1-2 credits

Snippets + answer + URLs

/extract (basic)

1 credit / URL

Full markdown of 1 URL

/extract (advanced)

2 credits / URL

Full markdown of 1 URL, JS rendering

Quick Use

Already have a Tavily API key (from search asset)
client.extract(urls=[...], extract_depth="advanced") — pass up to 20 URLs
Iterate response["results"] for clean markdown per URL

Intro

Tavily Extract takes a list of URLs and returns clean LLM-ready markdown — no HTML, no ads, no nav menus, no cookie banners. Up to 20 URLs per call, with extract_depth: advanced for tricky sites. Best for: agents that have a list of URLs (from Search, your own sources, or user input) and need the actual content. Works with: Tavily REST API, Python / TypeScript SDK. Setup time: 2 minutes.

Extract clean content

from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls=[
        "https://docs.anthropic.com/en/docs/claude-code",
        "https://docs.cursor.com/composer",
        "https://docs.continue.dev/intro",
    ],
    extract_depth="advanced",  # vs "basic" — slower but cleaner on JS-heavy sites
    include_images=False,
)

for result in response["results"]:
    print(result["url"])
    print(result["raw_content"][:500])
    print(f"({len(result['raw_content'])} chars total)")

# Failed URLs (404, blocked, etc) listed separately
for failed in response["failed_results"]:
    print(f"FAILED: {failed['url']} — {failed['error']}")

Pair with Search for full RAG

search = client.search(query="claude code subagents best practices", max_results=10)
urls = [r["url"] for r in search["results"]]

# Get full content for top 5
extracts = client.extract(urls=urls[:5], extract_depth="advanced")

# Now feed both summaries (from search) and full text (from extract) to an LLM
context = "\n\n".join(e["raw_content"] for e in extracts["results"])

Cost vs Search

Endpoint	Cost	Output
`/search`	1-2 credits	Snippets + answer + URLs
`/extract` (basic)	1 credit / URL	Full markdown of 1 URL
`/extract` (advanced)	2 credits / URL	Full markdown of 1 URL, JS rendering

For RAG: use Search to find URLs, Extract for the ones worth deep-reading. Don't extract every search result — most are summary-quality already in Search output.

FAQ

Q: How is Tavily Extract different from Firecrawl? A: Both produce LLM-ready markdown. Firecrawl is dedicated to scraping with more knobs (Crawl, Map, structured Extract via schema). Tavily Extract is the URL-to-content companion of Tavily Search, optimized for batch extraction during agent runs. Different ergonomics, similar output.

Q: Does it handle paywalls? A: No — Tavily Extract respects paywalls. It returns the public preview content, not the paywalled article. For internal authenticated sources, use Tavily's enterprise tier with custom auth.

Q: Can I extract images? A: Yes — set include_images=True. The response includes image URLs and alt text. Images are linked, not downloaded; you'd fetch them separately if needed.

Source & Thanks

Built by Tavily. Commercial product with free tier.

tavily.com/docs/extract — Extract docs

Tavily Extract — Pull Clean Content from Any URL

Staging seguro para este activo

Extract clean content

Pair with Search for full RAG

Cost vs Search

FAQ

Quick Use

Intro

Extract clean content

Pair with Search for full RAG

Cost vs Search

FAQ

Source & Thanks

Fuente y agradecimientos

Discusión

Activos relacionados

Tavily — Search API Built for AI Agents & RAG

Tavily Search — Search API Built for AI Agents

RSSHub — Universal RSS Feed Generator for Any Website

Packer — Automated Machine Image Building for Any Platform