How do I install Defuddle — Extract Main Content from Any Web Page as Markdown?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Defuddle — Extract Main Content from Any Web Page as Markdown

Introduction

Defuddle is a TypeScript library created by the maker of Obsidian that extracts the main content from an HTML page and converts it to clean Markdown. It removes navigation bars, sidebars, footers, ads, and other chrome, leaving only the article or document body. Defuddle is useful for web clipping, building read-it-later apps, and preparing web content for LLM ingestion.

What Defuddle Does

Parses raw HTML and identifies the primary content region of a page
Strips boilerplate elements like headers, footers, cookie banners, and ads
Converts the remaining HTML to well-formatted Markdown
Extracts metadata including title, author, published date, and description
Handles math notation, code blocks, tables, and image references

Architecture Overview

Defuddle uses a scoring algorithm to evaluate DOM nodes based on text density, tag semantics, and structural patterns. It penalizes elements with navigation, sidebar, or footer class names and rewards article-like containers. After selecting the best content node, a Turndown-based converter transforms the HTML subtree to Markdown with custom rules for code, math, and tables. The library runs in Node.js and in the browser via its TypeScript build.

Self-Hosting & Configuration

Install via npm: npm install defuddle
Use the CLI for quick extraction: npx defuddle <url>
Import programmatically and pass an HTML string to the constructor
Access result.title, result.author, result.markdown, and other metadata fields
Configure output options to include or exclude images and links

Key Features

Produces LLM-ready Markdown from any web page
Handles complex layouts including multi-column and infinite-scroll pages
Preserves code blocks with language hints and math blocks with LaTeX notation
Lightweight with no headless browser dependency when given raw HTML
Created by the Obsidian team with a focus on note-clipping quality

Comparison with Similar Tools

Readability (Mozilla) — extracts content as HTML; Defuddle goes further to produce Markdown
Turndown — HTML-to-Markdown only; Defuddle adds content extraction and boilerplate removal
MarkItDown (Microsoft) — converts files and documents; Defuddle focuses on web page extraction
Jina Reader — cloud API for URL-to-text; Defuddle runs locally without API calls
Trafilatura — Python content extractor; Defuddle is TypeScript-native

FAQ

Q: Does Defuddle fetch the web page itself? A: The CLI fetches URLs, but the library accepts raw HTML strings, so you can bring your own fetcher.

Q: Can Defuddle handle JavaScript-rendered pages? A: Defuddle works on the HTML it receives. For JS-rendered pages, pass the rendered DOM from a headless browser.

Q: How does Defuddle compare to Obsidian Web Clipper? A: Defuddle is the open-source extraction engine. The Obsidian Web Clipper browser extension uses it internally.

Q: Is Defuddle suitable for bulk scraping? A: It is designed for content extraction. Pair it with a crawler like Crawlee for bulk workflows.

Defuddle — Extract Main Content from Any Web Page as Markdown

Cet actif peut être lu et installé directement par les agents

Introduction

What Defuddle Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

Tavily Extract — Pull Clean Content from Any URL

MinerU — Extract LLM-Ready Data from Any Document

Firecrawl Extract — Structured Data from Any URL

Jina Reader — AI-Friendly Web Content Extraction