Cette page est affichée en anglais. Une traduction française est en cours.
WorkflowsMay 8, 2026·4 min de lecture

Tavily Extract — Pull Clean Content from Any URL

Tavily Extract converts up to 20 URLs into LLM-ready markdown in one API call. Skips ads, navigation, footers. Returns clean prose with citation metadata.

Tavily
Tavily · Community
Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Stage only · 17/100Stage only
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : New
Point d'entrée
Asset
Commande CLI universelle
npx tokrepo install 430a3d0e-2b58-496c-91e8-bbdb5ad65572
Introduction

Tavily Extract takes a list of URLs and returns clean LLM-ready markdown — no HTML, no ads, no nav menus, no cookie banners. Up to 20 URLs per call, with extract_depth: advanced for tricky sites. Best for: agents that have a list of URLs (from Search, your own sources, or user input) and need the actual content. Works with: Tavily REST API, Python / TypeScript SDK. Setup time: 2 minutes.


Extract clean content

from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls=[
        "https://docs.anthropic.com/en/docs/claude-code",
        "https://docs.cursor.com/composer",
        "https://docs.continue.dev/intro",
    ],
    extract_depth="advanced",  # vs "basic" — slower but cleaner on JS-heavy sites
    include_images=False,
)

for result in response["results"]:
    print(result["url"])
    print(result["raw_content"][:500])
    print(f"({len(result['raw_content'])} chars total)")

# Failed URLs (404, blocked, etc) listed separately
for failed in response["failed_results"]:
    print(f"FAILED: {failed['url']}{failed['error']}")

Pair with Search for full RAG

search = client.search(query="claude code subagents best practices", max_results=10)
urls = [r["url"] for r in search["results"]]

# Get full content for top 5
extracts = client.extract(urls=urls[:5], extract_depth="advanced")

# Now feed both summaries (from search) and full text (from extract) to an LLM
context = "\n\n".join(e["raw_content"] for e in extracts["results"])

Cost vs Search

Endpoint Cost Output
/search 1-2 credits Snippets + answer + URLs
/extract (basic) 1 credit / URL Full markdown of 1 URL
/extract (advanced) 2 credits / URL Full markdown of 1 URL, JS rendering

For RAG: use Search to find URLs, Extract for the ones worth deep-reading. Don't extract every search result — most are summary-quality already in Search output.


FAQ

Q: How is Tavily Extract different from Firecrawl? A: Both produce LLM-ready markdown. Firecrawl is dedicated to scraping with more knobs (Crawl, Map, structured Extract via schema). Tavily Extract is the URL-to-content companion of Tavily Search, optimized for batch extraction during agent runs. Different ergonomics, similar output.

Q: Does it handle paywalls? A: No — Tavily Extract respects paywalls. It returns the public preview content, not the paywalled article. For internal authenticated sources, use Tavily's enterprise tier with custom auth.

Q: Can I extract images? A: Yes — set include_images=True. The response includes image URLs and alt text. Images are linked, not downloaded; you'd fetch them separately if needed.


Quick Use

  1. Already have a Tavily API key (from search asset)
  2. client.extract(urls=[...], extract_depth="advanced") — pass up to 20 URLs
  3. Iterate response["results"] for clean markdown per URL

Intro

Tavily Extract takes a list of URLs and returns clean LLM-ready markdown — no HTML, no ads, no nav menus, no cookie banners. Up to 20 URLs per call, with extract_depth: advanced for tricky sites. Best for: agents that have a list of URLs (from Search, your own sources, or user input) and need the actual content. Works with: Tavily REST API, Python / TypeScript SDK. Setup time: 2 minutes.


Extract clean content

from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls=[
        "https://docs.anthropic.com/en/docs/claude-code",
        "https://docs.cursor.com/composer",
        "https://docs.continue.dev/intro",
    ],
    extract_depth="advanced",  # vs "basic" — slower but cleaner on JS-heavy sites
    include_images=False,
)

for result in response["results"]:
    print(result["url"])
    print(result["raw_content"][:500])
    print(f"({len(result['raw_content'])} chars total)")

# Failed URLs (404, blocked, etc) listed separately
for failed in response["failed_results"]:
    print(f"FAILED: {failed['url']}{failed['error']}")

Pair with Search for full RAG

search = client.search(query="claude code subagents best practices", max_results=10)
urls = [r["url"] for r in search["results"]]

# Get full content for top 5
extracts = client.extract(urls=urls[:5], extract_depth="advanced")

# Now feed both summaries (from search) and full text (from extract) to an LLM
context = "\n\n".join(e["raw_content"] for e in extracts["results"])

Cost vs Search

Endpoint Cost Output
/search 1-2 credits Snippets + answer + URLs
/extract (basic) 1 credit / URL Full markdown of 1 URL
/extract (advanced) 2 credits / URL Full markdown of 1 URL, JS rendering

For RAG: use Search to find URLs, Extract for the ones worth deep-reading. Don't extract every search result — most are summary-quality already in Search output.


FAQ

Q: How is Tavily Extract different from Firecrawl? A: Both produce LLM-ready markdown. Firecrawl is dedicated to scraping with more knobs (Crawl, Map, structured Extract via schema). Tavily Extract is the URL-to-content companion of Tavily Search, optimized for batch extraction during agent runs. Different ergonomics, similar output.

Q: Does it handle paywalls? A: No — Tavily Extract respects paywalls. It returns the public preview content, not the paywalled article. For internal authenticated sources, use Tavily's enterprise tier with custom auth.

Q: Can I extract images? A: Yes — set include_images=True. The response includes image URLs and alt text. Images are linked, not downloaded; you'd fetch them separately if needed.


Source & Thanks

Built by Tavily. Commercial product with free tier.

tavily.com/docs/extract — Extract docs

🙏

Source et remerciements

Built by Tavily. Commercial product with free tier.

tavily.com/docs/extract — Extract docs

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires