Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsMay 8, 2026·4 min de lectura

Tavily Extract — Pull Clean Content from Any URL

Tavily Extract converts up to 20 URLs into LLM-ready markdown in one API call. Skips ads, navigation, footers. Returns clean prose with citation metadata.

Listo para agents

Staging seguro para este activo

Este activo primero queda en staging. El prompt copiado pide inspeccionar los archivos staged antes de activar scripts, config MCP o config global.

Stage only · 29/100Política: staging
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Stage only
Confianza
Confianza: Community
Entrada
Asset
Comando de staging seguro
npx -y tokrepo@latest install 430a3d0e-2b58-496c-91e8-bbdb5ad65572 --target codex

Primero deja archivos en staging; la activación requiere revisar el README y el plan staged.

Introducción

Tavily Extract takes a list of URLs and returns clean LLM-ready markdown — no HTML, no ads, no nav menus, no cookie banners. Up to 20 URLs per call, with extract_depth: advanced for tricky sites. Best for: agents that have a list of URLs (from Search, your own sources, or user input) and need the actual content. Works with: Tavily REST API, Python / TypeScript SDK. Setup time: 2 minutes.


Extract clean content

from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls=[
        "https://docs.anthropic.com/en/docs/claude-code",
        "https://docs.cursor.com/composer",
        "https://docs.continue.dev/intro",
    ],
    extract_depth="advanced",  # vs "basic" — slower but cleaner on JS-heavy sites
    include_images=False,
)

for result in response["results"]:
    print(result["url"])
    print(result["raw_content"][:500])
    print(f"({len(result['raw_content'])} chars total)")

# Failed URLs (404, blocked, etc) listed separately
for failed in response["failed_results"]:
    print(f"FAILED: {failed['url']}{failed['error']}")

Pair with Search for full RAG

search = client.search(query="claude code subagents best practices", max_results=10)
urls = [r["url"] for r in search["results"]]

# Get full content for top 5
extracts = client.extract(urls=urls[:5], extract_depth="advanced")

# Now feed both summaries (from search) and full text (from extract) to an LLM
context = "\n\n".join(e["raw_content"] for e in extracts["results"])

Cost vs Search

Endpoint Cost Output
/search 1-2 credits Snippets + answer + URLs
/extract (basic) 1 credit / URL Full markdown of 1 URL
/extract (advanced) 2 credits / URL Full markdown of 1 URL, JS rendering

For RAG: use Search to find URLs, Extract for the ones worth deep-reading. Don't extract every search result — most are summary-quality already in Search output.


FAQ

Q: How is Tavily Extract different from Firecrawl? A: Both produce LLM-ready markdown. Firecrawl is dedicated to scraping with more knobs (Crawl, Map, structured Extract via schema). Tavily Extract is the URL-to-content companion of Tavily Search, optimized for batch extraction during agent runs. Different ergonomics, similar output.

Q: Does it handle paywalls? A: No — Tavily Extract respects paywalls. It returns the public preview content, not the paywalled article. For internal authenticated sources, use Tavily's enterprise tier with custom auth.

Q: Can I extract images? A: Yes — set include_images=True. The response includes image URLs and alt text. Images are linked, not downloaded; you'd fetch them separately if needed.


Quick Use

  1. Already have a Tavily API key (from search asset)
  2. client.extract(urls=[...], extract_depth="advanced") — pass up to 20 URLs
  3. Iterate response["results"] for clean markdown per URL

Intro

Tavily Extract takes a list of URLs and returns clean LLM-ready markdown — no HTML, no ads, no nav menus, no cookie banners. Up to 20 URLs per call, with extract_depth: advanced for tricky sites. Best for: agents that have a list of URLs (from Search, your own sources, or user input) and need the actual content. Works with: Tavily REST API, Python / TypeScript SDK. Setup time: 2 minutes.


Extract clean content

from tavily import TavilyClient

client = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

response = client.extract(
    urls=[
        "https://docs.anthropic.com/en/docs/claude-code",
        "https://docs.cursor.com/composer",
        "https://docs.continue.dev/intro",
    ],
    extract_depth="advanced",  # vs "basic" — slower but cleaner on JS-heavy sites
    include_images=False,
)

for result in response["results"]:
    print(result["url"])
    print(result["raw_content"][:500])
    print(f"({len(result['raw_content'])} chars total)")

# Failed URLs (404, blocked, etc) listed separately
for failed in response["failed_results"]:
    print(f"FAILED: {failed['url']}{failed['error']}")

Pair with Search for full RAG

search = client.search(query="claude code subagents best practices", max_results=10)
urls = [r["url"] for r in search["results"]]

# Get full content for top 5
extracts = client.extract(urls=urls[:5], extract_depth="advanced")

# Now feed both summaries (from search) and full text (from extract) to an LLM
context = "\n\n".join(e["raw_content"] for e in extracts["results"])

Cost vs Search

Endpoint Cost Output
/search 1-2 credits Snippets + answer + URLs
/extract (basic) 1 credit / URL Full markdown of 1 URL
/extract (advanced) 2 credits / URL Full markdown of 1 URL, JS rendering

For RAG: use Search to find URLs, Extract for the ones worth deep-reading. Don't extract every search result — most are summary-quality already in Search output.


FAQ

Q: How is Tavily Extract different from Firecrawl? A: Both produce LLM-ready markdown. Firecrawl is dedicated to scraping with more knobs (Crawl, Map, structured Extract via schema). Tavily Extract is the URL-to-content companion of Tavily Search, optimized for batch extraction during agent runs. Different ergonomics, similar output.

Q: Does it handle paywalls? A: No — Tavily Extract respects paywalls. It returns the public preview content, not the paywalled article. For internal authenticated sources, use Tavily's enterprise tier with custom auth.

Q: Can I extract images? A: Yes — set include_images=True. The response includes image URLs and alt text. Images are linked, not downloaded; you'd fetch them separately if needed.


Source & Thanks

Built by Tavily. Commercial product with free tier.

tavily.com/docs/extract — Extract docs

🙏

Fuente y agradecimientos

Built by Tavily. Commercial product with free tier.

tavily.com/docs/extract — Extract docs

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados