ArchiveBox — Self-Hosted Web Archiving Platform
ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.
What it is
ArchiveBox is an open-source self-hosted web archiver that preserves web content before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats: HTML, PDF, screenshot, WARC, media files, and Git repos.
ArchiveBox is built for anyone who wants a personal internet archive. It runs as a Python/Django application with a web UI for browsing, searching, and managing your archive. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.
How it saves time or tokens
ArchiveBox automates the tedious work of saving web pages manually. Instead of right-clicking 'Save as' on each page, you feed it a list of URLs (from bookmarks export, browser history, or RSS feeds) and it captures everything in multiple formats simultaneously. Scheduled archiving via cron means your RSS feeds and bookmark sources are preserved automatically. The multi-format approach ensures you always have a readable copy even when the original format fails to render.
How to use
- Start ArchiveBox with Docker Compose:
curl -fsSL https://docker-compose.archivebox.io -o docker-compose.yml
docker compose up -d
- Add URLs to archive:
docker compose exec archivebox archivebox add 'https://example.com'
docker compose exec archivebox archivebox add --depth=1 'https://news.ycombinator.com'
- Browse your archive through the web UI at http://localhost:8000, search by title, URL, or content.
Example
Scheduled archiving of an RSS feed with cron:
# Add to crontab: archive new items from HN RSS every hour
0 * * * * cd /path/to/archivebox && \
docker compose exec -T archivebox archivebox add \
--depth=0 \
'https://hnrss.org/newest?points=100'
# Archive all links from your Pinboard bookmarks
docker compose exec archivebox archivebox add \
'https://feeds.pinboard.in/json/v1/posts/all?auth_token=user:TOKEN'
# Import browser history
docker compose exec archivebox archivebox add \
--parser=pocket_html < pocket_export.html
Related on TokRepo
- Self-hosted tools — More self-hostable tools for data preservation and infrastructure.
- Automation tools — Browse automation frameworks for scheduled data tasks.
Common pitfalls
- Running ArchiveBox without sufficient disk space causes archives to fail silently. Monitor disk usage, especially when archiving media-heavy sites with --depth=1.
- The default SQLite database works for small archives but slows down with tens of thousands of snapshots. Consider switching to PostgreSQL for larger installations.
- Archiving dynamic JavaScript-heavy sites without Chrome headless installed produces incomplete snapshots. Ensure Chromium is available in your Docker or host environment.
Frequently Asked Questions
ArchiveBox saves pages as static HTML (via wget), PDF (via Chrome headless), screenshot PNG, WARC (web archive), plain text, Git repos, audio and video files, and readability-extracted article text. Each URL gets saved in all available formats simultaneously.
Yes. Use the --depth=1 flag to follow all links on a page and archive them too. Be cautious with depth values above 1, as the number of pages grows exponentially.
Each snapshot is stored as a directory containing the archived files and a JSON index. This makes archives portable and tool-independent. You can browse archives offline by opening the HTML files directly.
Yes. The web UI provides full-text search across all archived content. You can search by URL, page title, tags, or content within the archived pages.
Yes. Set up a cron job to run archivebox add with RSS feed URLs or bookmark export URLs on a schedule. ArchiveBox deduplicates URLs automatically, so repeated runs only archive new content.
Citations (3)
- ArchiveBox GitHub— ArchiveBox is an open-source web archiving tool
- ArchiveBox Documentation— ArchiveBox setup and configuration guide
- IIPC WARC Specification— WARC web archive format specification
Related on TokRepo
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.