Scripts2026年5月23日·1 分钟阅读

Defuddle — Extract Main Content from Any Web Page as Markdown

TypeScript library that strips navigation, ads, and boilerplate from HTML pages, returning clean Markdown of the main content.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Defuddle Overview
通用 CLI 安装命令
npx tokrepo install 48821de0-56e6-11f1-9bc6-00163e2b0d79

Introduction

Defuddle is a TypeScript library created by the maker of Obsidian that extracts the main content from an HTML page and converts it to clean Markdown. It removes navigation bars, sidebars, footers, ads, and other chrome, leaving only the article or document body. Defuddle is useful for web clipping, building read-it-later apps, and preparing web content for LLM ingestion.

What Defuddle Does

  • Parses raw HTML and identifies the primary content region of a page
  • Strips boilerplate elements like headers, footers, cookie banners, and ads
  • Converts the remaining HTML to well-formatted Markdown
  • Extracts metadata including title, author, published date, and description
  • Handles math notation, code blocks, tables, and image references

Architecture Overview

Defuddle uses a scoring algorithm to evaluate DOM nodes based on text density, tag semantics, and structural patterns. It penalizes elements with navigation, sidebar, or footer class names and rewards article-like containers. After selecting the best content node, a Turndown-based converter transforms the HTML subtree to Markdown with custom rules for code, math, and tables. The library runs in Node.js and in the browser via its TypeScript build.

Self-Hosting & Configuration

  • Install via npm: npm install defuddle
  • Use the CLI for quick extraction: npx defuddle <url>
  • Import programmatically and pass an HTML string to the constructor
  • Access result.title, result.author, result.markdown, and other metadata fields
  • Configure output options to include or exclude images and links

Key Features

  • Produces LLM-ready Markdown from any web page
  • Handles complex layouts including multi-column and infinite-scroll pages
  • Preserves code blocks with language hints and math blocks with LaTeX notation
  • Lightweight with no headless browser dependency when given raw HTML
  • Created by the Obsidian team with a focus on note-clipping quality

Comparison with Similar Tools

  • Readability (Mozilla) — extracts content as HTML; Defuddle goes further to produce Markdown
  • Turndown — HTML-to-Markdown only; Defuddle adds content extraction and boilerplate removal
  • MarkItDown (Microsoft) — converts files and documents; Defuddle focuses on web page extraction
  • Jina Reader — cloud API for URL-to-text; Defuddle runs locally without API calls
  • Trafilatura — Python content extractor; Defuddle is TypeScript-native

FAQ

Q: Does Defuddle fetch the web page itself? A: The CLI fetches URLs, but the library accepts raw HTML strings, so you can bring your own fetcher.

Q: Can Defuddle handle JavaScript-rendered pages? A: Defuddle works on the HTML it receives. For JS-rendered pages, pass the rendered DOM from a headless browser.

Q: How does Defuddle compare to Obsidian Web Clipper? A: Defuddle is the open-source extraction engine. The Obsidian Web Clipper browser extension uses it internally.

Q: Is Defuddle suitable for bulk scraping? A: It is designed for content extraction. Pair it with a crawler like Crawlee for bulk workflows.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产