Esta página se muestra en inglés. Una traducción al español está en curso.
MCP ConfigsApr 7, 2026·2 min de lectura

Jina Reader — AI-Friendly Web Content Extraction

Convert any URL to clean markdown for AI consumption. Free API at r.jina.ai strips ads, navigation, and clutter. Used by AI agents for web research and RAG.

Listo para agents

Staging seguro para este activo

Este activo primero queda en staging. El prompt copiado pide inspeccionar los archivos staged antes de activar scripts, config MCP o config global.

Stage only · 17/100Política: staging
Superficie agent
Cualquier agent MCP/CLI
Tipo
Mcp Config
Instalación
Stage only
Confianza
Confianza: Established
Entrada
Jina Reader — AI-Friendly Web Content Extraction
Comando de staging seguro
npx -y tokrepo@latest install 9c6cbf5f-e46b-40d6-aaf6-d2a4d5a0e657 --target codex

Primero deja archivos en staging; la activación requiere revisar el README y el plan staged.

TL;DR
A URL prefix API that turns web pages into LLM-ready Markdown for agents and RAG.
§01

What it is

Jina Reader is a web-to-Markdown conversion interface you call over plain HTTP. The core idea is simple: take a URL that a human would open in a browser, and return a cleaned, structured Markdown representation that is easier for LLMs to read than raw HTML.

The workflow pattern is intentionally low-friction. You typically prefix a target URL with https://r.jina.ai/ and fetch the response. Because the result is Markdown, you can directly place it into an agent’s context window, store it as an artifact, or chunk/embed it for retrieval (RAG) without writing custom HTML parsing logic.

TokRepo curates Jina Reader as an “agent ingestion primitive”: treat it like a normalization layer in front of your LLM. Your agent decides what to read; Reader handles the repetitive mechanics of extracting readable content and turning it into a format that plays nicely with downstream prompts.

If you already have a browsing tool, Reader can still fit into the stack. Many teams use it when they want a deterministic “snapshot” that can be cached, diffed, or re-used across multiple agent runs, instead of re-rendering the same page and paying parsing/token costs every time.

§02

How it saves time or tokens

Agentic web research often burns time and tokens on the same three chores:

  1. Page rendering: modern sites can be JavaScript-heavy, and content may appear only after a client waits for hydration or dynamic rendering.
  2. Content extraction: even if the HTML arrives, the useful content is mixed with navigation, sidebars, cookie banners, and repeated chrome.
  3. Normalization: downstream steps (chunking, embedding, prompting) work best when the input is consistent: stable headings, predictable list styles, clean link formatting.

Reader helps by collapsing those chores into a single call: “URL → Markdown.” When you hand a model a compact Markdown page instead of a noisy DOM dump, you reduce prompt size and increase signal density. That typically means fewer “please ignore the header/footer” instructions, fewer retries, and less need for custom extraction prompts.

For RAG pipelines, Reader’s value is that it produces ingestion-friendly text. You can apply the same chunking rules to many sources, and you can store the Markdown as a durable artifact that survives site redesigns better than brittle CSS selectors. For long-running systems, caching is also a big win: if your pipeline can hash the Markdown and avoid re-ingesting unchanged pages, you save both network time and embedding/model costs.

Reader is also useful as a guardrail. Instead of letting an agent freely browse and paste arbitrary HTML into the context window, you can enforce a policy: “All web content must enter the prompt through Reader, and must pass a length/token budget before the model sees it.” That makes costs and failure modes more predictable.

In practical systems, a lot of the benefit comes from where you insert Reader:

  • Before reasoning: normalize content first, then reason. This avoids burning tokens on “cleanup prompts” that try to summarize messy HTML.
  • Before indexing: indexing raw HTML tends to produce noisy embeddings. Markdown is usually cleaner, which improves retrieval quality and reduces the need for heavy post-processing.
  • Before tool fan-out: if an agent reads multiple sources, normalize each source with the same rules so synthesis is easier (consistent headings, consistent link formatting, fewer surprises).

Reader can also be used as a debugging primitive. When a browse step fails, saving the returned Markdown snapshot next to the agent run gives you a stable artifact to inspect. This reduces the “it worked yesterday” problem caused by dynamic pages changing shape between runs, and it makes it easier to build regression tests for your browsing toolchain.

Finally, Reader’s header-based configuration is useful for gradual hardening. Start with the simplest call (URL prefix → Markdown). When you encounter a failure mode—timeouts, missing content, overly long pages—add a single knob to your wrapper and keep it behind a default-safe policy. Over time you get a small, composable “web ingestion API” that is easier to maintain than dozens of site-specific scrapers.

§03

How to use

  1. Pick the page you want to read (documentation, a GitHub issue, a blog post, etc.).
  2. Prefix the URL with https://r.jina.ai/.
  3. Fetch the result and consume it as Markdown (directly in a prompt, or through your RAG pipeline).

If you are wrapping Reader for an agent, it helps to expose a small set of parameters in your tool schema:

  • Output format: Markdown is the default choice for LLM ingestion.
  • Timeout / waiting: allow longer waits for SPAs or heavy pages.
  • Budget controls: apply a token/length cap before returning content to the model.
  • Scope controls: when possible, fetch less (a selector or narrowed context) to keep content small.

As a practical operating rule: treat Reader output as an artifact, not as transient prompt text. Persist the Markdown (or a hash of it) with your agent run, so you can reproduce decisions later and avoid re-fetching unchanged pages. When you need a “short” version for context windows, generate the summary from the stored Markdown rather than re-browsing the web, so the summary is tied to a stable source snapshot.

For a “high-trust” agent, consider adding two more safety layers:

  • Allow/deny lists for domains: many teams only allow an agent to read from documentation sites and trusted domains. Reader makes this easy, because every fetch is one URL.
  • Attribution in the prompt: store the original URL alongside the Markdown (or prepend a small header) so downstream reasoning can cite sources accurately and you can trace where claims came from.

If you are using Reader in a production ingestion pipeline, decide up front how you will handle:

  • Rate limiting: back off when you hit limits; do not turn a transient error into an agent loop.
  • Retries: a second attempt with a longer timeout can turn a “thin” response into a usable snapshot.
  • Deduplication: hash the Markdown and skip re-indexing if it did not change.
§04

Example

# 1) Convert a page to markdown (URL prefix pattern)
curl 'https://r.jina.ai/https://example.com'

# 2) Ask for markdown explicitly (some clients prefer headers over defaults)
curl -H 'X-Respond-With: markdown' 'https://r.jina.ai/https://github.com/jina-ai/reader'

# 3) When you are cost-sensitive, keep a strict budget in your wrapper:
#    fetch -> measure length -> truncate or reject before sending to your LLM
# A common agent pattern: fetch -> write snapshot -> pass snapshot path to the model
URL='https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence'
OUT='/tmp/reader_snapshot.md'
curl -sS -H 'X-Respond-With: markdown' "$URL" > "$OUT"
wc -c "$OUT"
import requests

target = "https://docs.example.com/your-page"
url = "https://r.jina.ai/" + target

resp = requests.get(url, headers={"X-Respond-With": "markdown"}, timeout=30)
resp.raise_for_status()
markdown = resp.text

# Example policy: keep the page under a size limit before injecting into prompts
if len(markdown) > 60_000:
    markdown = markdown[:60_000]

# Optional: attach attribution metadata for downstream prompts
attributed = f"Source: {target}\\n\\n---\\n\\n{markdown}"

When you integrate Reader into an agent framework, keep the interface narrow. The agent should not need to know about extraction internals. A good tool signature is fetch_markdown(url) -> markdown_text, plus a few optional knobs (timeout, max length). Everything else can remain an implementation detail.

§05

Related on TokRepo

§06

Common pitfalls

  • Forgetting URL encoding when generating URLs programmatically. If the original URL contains query parameters, encode it correctly before prefixing it.
  • Assuming every site is static HTML. Some sites render content late; your wrapper should support longer waits/timeouts, and you should be ready to retry with a different strategy when you get thin output.
  • Over-feeding huge pages to a model. Put a hard budget on the Reader output: reject, truncate, or re-fetch with a narrower scope (target selector) before you pay for downstream tokens.
  • Caching without invalidation. Caching Markdown snapshots is powerful, but you need a simple invalidation policy (TTL or hash-based re-fetch) so you do not serve stale content forever.
  • Leaking secrets in logs. If your agent fetches internal URLs, treat URLs and fetched content as potentially sensitive; do not log full bodies by default.

Preguntas frecuentes

What is Jina Reader?+

Jina Reader is a web-to-Markdown conversion interface that you call over HTTP. You pass a target URL (often by prefixing it with https://r.jina.ai/) and receive a cleaned Markdown representation that is easier for LLMs to consume than raw HTML. It is useful when you need consistent text extraction for browsing agents, research pipelines, or RAG ingestion, because the downstream model can focus on content instead of page chrome and markup noise.

Is Jina Reader free to use?+

The upstream project describes Reader as the open-source branch behind the public r.jina.ai and s.jina.ai endpoints, and the TokRepo workflow demonstrates calling the hosted endpoint directly. If you need stronger guarantees (rate limits, SLA, or self-hosting), follow the upstream repository and documentation to understand deployment and terms. For compliance-sensitive workloads, treat the GitHub repository and the live API docs as the source of truth.

How do I use Jina Reader in an agent or RAG pipeline?+

Use Reader as a deterministic pre-processing step: fetch Markdown first, then chunk and embed it (RAG) or pass it to your agent’s reasoning prompt. The simplest pattern is URL → Reader Markdown → chunking/cleanup → vector store or context window. This keeps prompts smaller and more consistent than injecting HTML. In tool-calling agents, you can wrap the Reader call as a single tool that returns Markdown text for any URL.

How can I control the output format (markdown vs html vs text)?+

Reader supports request headers to select output and tune behavior. A practical approach is to start with Markdown output for LLM ingestion, then switch to raw HTML only when you need to debug what the page actually returned. The upstream docs list useful headers such as choosing the output format and selecting an engine. Keep these options in your wrapper so your agent can request stricter or more complete fetches when needed.

What are common reliability issues when fetching web pages?+

The most common issues are pages that render content late (single-page apps), pages that block automated clients, and pages that are simply too large for your model budget. To mitigate this, add timeouts and retries, encode URLs correctly, and use tighter extraction (selectors, smaller scopes) when possible. When cost matters, enforce token or length budgets before passing the response into a model, and store hashes to avoid re-ingesting unchanged content.

Referencias (3)
🙏

Fuente y agradecimientos

Created by Jina AI. Licensed under Apache 2.0.

jina-ai/reader — 20k+ stars

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Grafo de activos

Cómo se conecta este activo con el resto del registro.

Activos relacionados