Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 23, 2026·3 min de lectura

Defuddle — Extract Main Content from Any Web Page as Markdown

TypeScript library that strips navigation, ads, and boilerplate from HTML pages, returning clean Markdown of the main content.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Defuddle Overview
Comando CLI universal
npx tokrepo install 48821de0-56e6-11f1-9bc6-00163e2b0d79

Introduction

Defuddle is a TypeScript library created by the maker of Obsidian that extracts the main content from an HTML page and converts it to clean Markdown. It removes navigation bars, sidebars, footers, ads, and other chrome, leaving only the article or document body. Defuddle is useful for web clipping, building read-it-later apps, and preparing web content for LLM ingestion.

What Defuddle Does

  • Parses raw HTML and identifies the primary content region of a page
  • Strips boilerplate elements like headers, footers, cookie banners, and ads
  • Converts the remaining HTML to well-formatted Markdown
  • Extracts metadata including title, author, published date, and description
  • Handles math notation, code blocks, tables, and image references

Architecture Overview

Defuddle uses a scoring algorithm to evaluate DOM nodes based on text density, tag semantics, and structural patterns. It penalizes elements with navigation, sidebar, or footer class names and rewards article-like containers. After selecting the best content node, a Turndown-based converter transforms the HTML subtree to Markdown with custom rules for code, math, and tables. The library runs in Node.js and in the browser via its TypeScript build.

Self-Hosting & Configuration

  • Install via npm: npm install defuddle
  • Use the CLI for quick extraction: npx defuddle <url>
  • Import programmatically and pass an HTML string to the constructor
  • Access result.title, result.author, result.markdown, and other metadata fields
  • Configure output options to include or exclude images and links

Key Features

  • Produces LLM-ready Markdown from any web page
  • Handles complex layouts including multi-column and infinite-scroll pages
  • Preserves code blocks with language hints and math blocks with LaTeX notation
  • Lightweight with no headless browser dependency when given raw HTML
  • Created by the Obsidian team with a focus on note-clipping quality

Comparison with Similar Tools

  • Readability (Mozilla) — extracts content as HTML; Defuddle goes further to produce Markdown
  • Turndown — HTML-to-Markdown only; Defuddle adds content extraction and boilerplate removal
  • MarkItDown (Microsoft) — converts files and documents; Defuddle focuses on web page extraction
  • Jina Reader — cloud API for URL-to-text; Defuddle runs locally without API calls
  • Trafilatura — Python content extractor; Defuddle is TypeScript-native

FAQ

Q: Does Defuddle fetch the web page itself? A: The CLI fetches URLs, but the library accepts raw HTML strings, so you can bring your own fetcher.

Q: Can Defuddle handle JavaScript-rendered pages? A: Defuddle works on the HTML it receives. For JS-rendered pages, pass the rendered DOM from a headless browser.

Q: How does Defuddle compare to Obsidian Web Clipper? A: Defuddle is the open-source extraction engine. The Obsidian Web Clipper browser extension uses it internally.

Q: Is Defuddle suitable for bulk scraping? A: It is designed for content extraction. Pair it with a crawler like Crawlee for bulk workflows.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados