Introduction
ArchiveBox preserves web content you care about before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats — HTML, PDF, screenshot, WARC — so you always have a local copy, even when the original goes offline.
What ArchiveBox Does
- Archives web pages in multiple formats: HTML, PDF, screenshot, WARC, media, and Git repos
- Accepts input from bookmarks exports, browser history, RSS feeds, and plain URL lists
- Provides a web UI for browsing, searching, and managing your archive
- Extracts and saves embedded media including images, videos, audio, and documents
- Schedules automatic archiving of RSS feeds and bookmark sources on a cron interval
Architecture Overview
ArchiveBox is a Python application built on Django with a SQLite database by default. It orchestrates a suite of external tools — wget, Chrome headless, youtube-dl, readability, mercury-parser — to capture pages in multiple formats simultaneously. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.
Self-Hosting & Configuration
- Deploy via Docker Compose, pip install, or Homebrew on macOS and Linux
- Configure output formats, archiving depth, and tool preferences via ArchiveBox.conf
- Set up scheduled imports from RSS feeds, Pinboard, Pocket, or browser bookmark exports
- Use SQLite for small archives or PostgreSQL for larger collections
- Serve archives publicly or restrict access with Django authentication
Key Features
- Multi-format preservation ensures content survives even if one format fails
- Full-text search across all archived page content and metadata
- Browser extension and bookmarklet for one-click archiving of any page
- Portable archive format — each snapshot is a standalone folder of standard files
- Deduplication and incremental archiving to save storage on repeated URLs
Comparison with Similar Tools
- Wallabag — Read-it-later app focused on article reading, not full multi-format archiving
- SingleFile — Browser extension that saves single pages, but lacks batch processing and scheduling
- HTTrack — Classic website copier for mirroring entire sites, but no PDF/screenshot/WARC support
- Webrecorder/Conifer — WARC-focused archiving with replay, but requires more technical setup
- Pocket — Cloud-based bookmarking without self-hosted option or multi-format local storage
FAQ
Q: How much storage does ArchiveBox use per page? A: A typical page with all formats enabled uses 5-50 MB. You can disable formats like screenshots or WARC to reduce storage significantly.
Q: Can I archive pages behind login walls? A: Yes. You can configure browser cookies or use a logged-in Chrome profile to archive authenticated content.
Q: Does ArchiveBox respect robots.txt? A: By default, ArchiveBox respects robots.txt for wget fetches, but you can override this in configuration for personal archiving purposes.
Q: Can I export my archive to another tool? A: Yes. Archives are stored as standard files (HTML, PDF, PNG, WARC) in plain directories that any tool or file browser can access directly.