Jina Reader — AI-Friendly Web Content Extraction
Convert any URL to clean markdown for AI consumption. Free API at r.jina.ai strips ads, navigation, and clutter. Used by AI agents for web research and RAG.
这个资产会安全暂存
这个资产会先安全暂存。复制的指令会要求 Agent 读取暂存文件,并在激活脚本、MCP 配置或全局配置前先确认。
npx -y tokrepo@latest install 9c6cbf5f-e46b-40d6-aaf6-d2a4d5a0e657 --target codex先暂存文件;激活前需要读取暂存 README 和安装计划。
What it is
Jina Reader is a web-to-Markdown conversion interface you call over plain HTTP. The core idea is simple: take a URL that a human would open in a browser, and return a cleaned, structured Markdown representation that is easier for LLMs to read than raw HTML.
The workflow pattern is intentionally low-friction. You typically prefix a target URL with https://r.jina.ai/ and fetch the response. Because the result is Markdown, you can directly place it into an agent’s context window, store it as an artifact, or chunk/embed it for retrieval (RAG) without writing custom HTML parsing logic.
TokRepo curates Jina Reader as an “agent ingestion primitive”: treat it like a normalization layer in front of your LLM. Your agent decides what to read; Reader handles the repetitive mechanics of extracting readable content and turning it into a format that plays nicely with downstream prompts.
If you already have a browsing tool, Reader can still fit into the stack. Many teams use it when they want a deterministic “snapshot” that can be cached, diffed, or re-used across multiple agent runs, instead of re-rendering the same page and paying parsing/token costs every time.
How it saves time or tokens
Agentic web research often burns time and tokens on the same three chores:
- Page rendering: modern sites can be JavaScript-heavy, and content may appear only after a client waits for hydration or dynamic rendering.
- Content extraction: even if the HTML arrives, the useful content is mixed with navigation, sidebars, cookie banners, and repeated chrome.
- Normalization: downstream steps (chunking, embedding, prompting) work best when the input is consistent: stable headings, predictable list styles, clean link formatting.
Reader helps by collapsing those chores into a single call: “URL → Markdown.” When you hand a model a compact Markdown page instead of a noisy DOM dump, you reduce prompt size and increase signal density. That typically means fewer “please ignore the header/footer” instructions, fewer retries, and less need for custom extraction prompts.
For RAG pipelines, Reader’s value is that it produces ingestion-friendly text. You can apply the same chunking rules to many sources, and you can store the Markdown as a durable artifact that survives site redesigns better than brittle CSS selectors. For long-running systems, caching is also a big win: if your pipeline can hash the Markdown and avoid re-ingesting unchanged pages, you save both network time and embedding/model costs.
Reader is also useful as a guardrail. Instead of letting an agent freely browse and paste arbitrary HTML into the context window, you can enforce a policy: “All web content must enter the prompt through Reader, and must pass a length/token budget before the model sees it.” That makes costs and failure modes more predictable.
In practical systems, a lot of the benefit comes from where you insert Reader:
- Before reasoning: normalize content first, then reason. This avoids burning tokens on “cleanup prompts” that try to summarize messy HTML.
- Before indexing: indexing raw HTML tends to produce noisy embeddings. Markdown is usually cleaner, which improves retrieval quality and reduces the need for heavy post-processing.
- Before tool fan-out: if an agent reads multiple sources, normalize each source with the same rules so synthesis is easier (consistent headings, consistent link formatting, fewer surprises).
Reader can also be used as a debugging primitive. When a browse step fails, saving the returned Markdown snapshot next to the agent run gives you a stable artifact to inspect. This reduces the “it worked yesterday” problem caused by dynamic pages changing shape between runs, and it makes it easier to build regression tests for your browsing toolchain.
Finally, Reader’s header-based configuration is useful for gradual hardening. Start with the simplest call (URL prefix → Markdown). When you encounter a failure mode—timeouts, missing content, overly long pages—add a single knob to your wrapper and keep it behind a default-safe policy. Over time you get a small, composable “web ingestion API” that is easier to maintain than dozens of site-specific scrapers.
How to use
- Pick the page you want to read (documentation, a GitHub issue, a blog post, etc.).
- Prefix the URL with
https://r.jina.ai/. - Fetch the result and consume it as Markdown (directly in a prompt, or through your RAG pipeline).
If you are wrapping Reader for an agent, it helps to expose a small set of parameters in your tool schema:
- Output format: Markdown is the default choice for LLM ingestion.
- Timeout / waiting: allow longer waits for SPAs or heavy pages.
- Budget controls: apply a token/length cap before returning content to the model.
- Scope controls: when possible, fetch less (a selector or narrowed context) to keep content small.
As a practical operating rule: treat Reader output as an artifact, not as transient prompt text. Persist the Markdown (or a hash of it) with your agent run, so you can reproduce decisions later and avoid re-fetching unchanged pages. When you need a “short” version for context windows, generate the summary from the stored Markdown rather than re-browsing the web, so the summary is tied to a stable source snapshot.
For a “high-trust” agent, consider adding two more safety layers:
- Allow/deny lists for domains: many teams only allow an agent to read from documentation sites and trusted domains. Reader makes this easy, because every fetch is one URL.
- Attribution in the prompt: store the original URL alongside the Markdown (or prepend a small header) so downstream reasoning can cite sources accurately and you can trace where claims came from.
If you are using Reader in a production ingestion pipeline, decide up front how you will handle:
- Rate limiting: back off when you hit limits; do not turn a transient error into an agent loop.
- Retries: a second attempt with a longer timeout can turn a “thin” response into a usable snapshot.
- Deduplication: hash the Markdown and skip re-indexing if it did not change.
Example
# 1) Convert a page to markdown (URL prefix pattern)
curl 'https://r.jina.ai/https://example.com'
# 2) Ask for markdown explicitly (some clients prefer headers over defaults)
curl -H 'X-Respond-With: markdown' 'https://r.jina.ai/https://github.com/jina-ai/reader'
# 3) When you are cost-sensitive, keep a strict budget in your wrapper:
# fetch -> measure length -> truncate or reject before sending to your LLM
# A common agent pattern: fetch -> write snapshot -> pass snapshot path to the model
URL='https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence'
OUT='/tmp/reader_snapshot.md'
curl -sS -H 'X-Respond-With: markdown' "$URL" > "$OUT"
wc -c "$OUT"
import requests
target = "https://docs.example.com/your-page"
url = "https://r.jina.ai/" + target
resp = requests.get(url, headers={"X-Respond-With": "markdown"}, timeout=30)
resp.raise_for_status()
markdown = resp.text
# Example policy: keep the page under a size limit before injecting into prompts
if len(markdown) > 60_000:
markdown = markdown[:60_000]
# Optional: attach attribution metadata for downstream prompts
attributed = f"Source: {target}\\n\\n---\\n\\n{markdown}"
When you integrate Reader into an agent framework, keep the interface narrow. The agent should not need to know about extraction internals. A good tool signature is fetch_markdown(url) -> markdown_text, plus a few optional knobs (timeout, max length). Everything else can remain an implementation detail.
Related on TokRepo
- Automation tools — Compose Reader into crawling jobs and RAG ingestion pipelines.
- AI tools for web-scraping — Complementary building blocks for web ingestion and extraction.
Common pitfalls
- Forgetting URL encoding when generating URLs programmatically. If the original URL contains query parameters, encode it correctly before prefixing it.
- Assuming every site is static HTML. Some sites render content late; your wrapper should support longer waits/timeouts, and you should be ready to retry with a different strategy when you get thin output.
- Over-feeding huge pages to a model. Put a hard budget on the Reader output: reject, truncate, or re-fetch with a narrower scope (target selector) before you pay for downstream tokens.
- Caching without invalidation. Caching Markdown snapshots is powerful, but you need a simple invalidation policy (TTL or hash-based re-fetch) so you do not serve stale content forever.
- Leaking secrets in logs. If your agent fetches internal URLs, treat URLs and fetched content as potentially sensitive; do not log full bodies by default.
常见问题
Jina Reader is a web-to-Markdown conversion interface that you call over HTTP. You pass a target URL (often by prefixing it with https://r.jina.ai/) and receive a cleaned Markdown representation that is easier for LLMs to consume than raw HTML. It is useful when you need consistent text extraction for browsing agents, research pipelines, or RAG ingestion, because the downstream model can focus on content instead of page chrome and markup noise.
The upstream project describes Reader as the open-source branch behind the public r.jina.ai and s.jina.ai endpoints, and the TokRepo workflow demonstrates calling the hosted endpoint directly. If you need stronger guarantees (rate limits, SLA, or self-hosting), follow the upstream repository and documentation to understand deployment and terms. For compliance-sensitive workloads, treat the GitHub repository and the live API docs as the source of truth.
Use Reader as a deterministic pre-processing step: fetch Markdown first, then chunk and embed it (RAG) or pass it to your agent’s reasoning prompt. The simplest pattern is URL → Reader Markdown → chunking/cleanup → vector store or context window. This keeps prompts smaller and more consistent than injecting HTML. In tool-calling agents, you can wrap the Reader call as a single tool that returns Markdown text for any URL.
Reader supports request headers to select output and tune behavior. A practical approach is to start with Markdown output for LLM ingestion, then switch to raw HTML only when you need to debug what the page actually returned. The upstream docs list useful headers such as choosing the output format and selecting an engine. Keep these options in your wrapper so your agent can request stricter or more complete fetches when needed.
The most common issues are pages that render content late (single-page apps), pages that block automated clients, and pages that are simply too large for your model budget. To mitigate this, add timeouts and retries, encode URLs correctly, and use tighter extraction (selectors, smaller scopes) when possible. When cost matters, enforce token or length budgets before passing the response into a model, and store hashes to avoid re-ingesting unchanged content.
引用来源 (3)
- GitHub: jina-ai/reader— Project homepage and canonical documentation for this workflow.
- Jina Reader README (Usage)— Reader usage pattern: prefix any URL with r.jina.ai to fetch LLM-friendly output…
- Jina Reader docs— Reader supports request headers to control output format and behavior.
来源与感谢
jina-ai/reader — 20k+ stars, Apache 2.0
讨论
关系图
这个资产与注册表中其他资产的关联关系。
相关资产
Jina Reader — Convert Any URL to LLM-Ready Text
Convert any URL to clean, LLM-friendly markdown with a simple prefix. Just prepend r.jina.ai/ to any URL. Handles JS-rendered pages, PDFs, and images. 10K+ stars.
Apify MCP Server — 8,000+ Web Scrapers for Agents
Apify MCP Server connects agents to Apify Actors via a hosted endpoint (mcp.apify.com) or local run, turning thousands of web scrapers into callable tools.
Crawl4AI — LLM-Friendly Web Crawling
Open-source web crawler optimized for AI and LLM use cases. Extracts clean markdown, handles JavaScript-rendered pages, and supports structured data extraction.
Notte — Browser Automation MCP for AI Agents
MCP server that turns web browsers into AI agent tools. Notte provides structured browser actions like click, type, navigate, and extract for LLM-driven automation.