Introduction
Defuddle is a TypeScript library created by the maker of Obsidian that extracts the main content from an HTML page and converts it to clean Markdown. It removes navigation bars, sidebars, footers, ads, and other chrome, leaving only the article or document body. Defuddle is useful for web clipping, building read-it-later apps, and preparing web content for LLM ingestion.
What Defuddle Does
- Parses raw HTML and identifies the primary content region of a page
- Strips boilerplate elements like headers, footers, cookie banners, and ads
- Converts the remaining HTML to well-formatted Markdown
- Extracts metadata including title, author, published date, and description
- Handles math notation, code blocks, tables, and image references
Architecture Overview
Defuddle uses a scoring algorithm to evaluate DOM nodes based on text density, tag semantics, and structural patterns. It penalizes elements with navigation, sidebar, or footer class names and rewards article-like containers. After selecting the best content node, a Turndown-based converter transforms the HTML subtree to Markdown with custom rules for code, math, and tables. The library runs in Node.js and in the browser via its TypeScript build.
Self-Hosting & Configuration
- Install via npm:
npm install defuddle - Use the CLI for quick extraction:
npx defuddle <url> - Import programmatically and pass an HTML string to the constructor
- Access
result.title,result.author,result.markdown, and other metadata fields - Configure output options to include or exclude images and links
Key Features
- Produces LLM-ready Markdown from any web page
- Handles complex layouts including multi-column and infinite-scroll pages
- Preserves code blocks with language hints and math blocks with LaTeX notation
- Lightweight with no headless browser dependency when given raw HTML
- Created by the Obsidian team with a focus on note-clipping quality
Comparison with Similar Tools
- Readability (Mozilla) — extracts content as HTML; Defuddle goes further to produce Markdown
- Turndown — HTML-to-Markdown only; Defuddle adds content extraction and boilerplate removal
- MarkItDown (Microsoft) — converts files and documents; Defuddle focuses on web page extraction
- Jina Reader — cloud API for URL-to-text; Defuddle runs locally without API calls
- Trafilatura — Python content extractor; Defuddle is TypeScript-native
FAQ
Q: Does Defuddle fetch the web page itself? A: The CLI fetches URLs, but the library accepts raw HTML strings, so you can bring your own fetcher.
Q: Can Defuddle handle JavaScript-rendered pages? A: Defuddle works on the HTML it receives. For JS-rendered pages, pass the rendered DOM from a headless browser.
Q: How does Defuddle compare to Obsidian Web Clipper? A: Defuddle is the open-source extraction engine. The Obsidian Web Clipper browser extension uses it internally.
Q: Is Defuddle suitable for bulk scraping? A: It is designed for content extraction. Pair it with a crawler like Crawlee for bulk workflows.