ScriptsApr 16, 2026·3 min read

ArchiveBox — Self-Hosted Web Archiving Platform

ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.

TL;DR
ArchiveBox preserves web pages as HTML, PDF, screenshots, and WARC for offline access.
§01

What it is

ArchiveBox is an open-source self-hosted web archiver that preserves web content before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats: HTML, PDF, screenshot, WARC, media files, and Git repos.

ArchiveBox is built for anyone who wants a personal internet archive. It runs as a Python/Django application with a web UI for browsing, searching, and managing your archive. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.

§02

How it saves time or tokens

ArchiveBox automates the tedious work of saving web pages manually. Instead of right-clicking 'Save as' on each page, you feed it a list of URLs (from bookmarks export, browser history, or RSS feeds) and it captures everything in multiple formats simultaneously. Scheduled archiving via cron means your RSS feeds and bookmark sources are preserved automatically. The multi-format approach ensures you always have a readable copy even when the original format fails to render.

§03

How to use

  1. Start ArchiveBox with Docker Compose:
curl -fsSL https://docker-compose.archivebox.io -o docker-compose.yml
docker compose up -d
  1. Add URLs to archive:
docker compose exec archivebox archivebox add 'https://example.com'
docker compose exec archivebox archivebox add --depth=1 'https://news.ycombinator.com'
  1. Browse your archive through the web UI at http://localhost:8000, search by title, URL, or content.
§04

Example

Scheduled archiving of an RSS feed with cron:

# Add to crontab: archive new items from HN RSS every hour
0 * * * * cd /path/to/archivebox && \
  docker compose exec -T archivebox archivebox add \
  --depth=0 \
  'https://hnrss.org/newest?points=100'

# Archive all links from your Pinboard bookmarks
docker compose exec archivebox archivebox add \
  'https://feeds.pinboard.in/json/v1/posts/all?auth_token=user:TOKEN'

# Import browser history
docker compose exec archivebox archivebox add \
  --parser=pocket_html < pocket_export.html
§05

Related on TokRepo

§06

Common pitfalls

  • Running ArchiveBox without sufficient disk space causes archives to fail silently. Monitor disk usage, especially when archiving media-heavy sites with --depth=1.
  • The default SQLite database works for small archives but slows down with tens of thousands of snapshots. Consider switching to PostgreSQL for larger installations.
  • Archiving dynamic JavaScript-heavy sites without Chrome headless installed produces incomplete snapshots. Ensure Chromium is available in your Docker or host environment.

Frequently Asked Questions

What archive formats does ArchiveBox support?+

ArchiveBox saves pages as static HTML (via wget), PDF (via Chrome headless), screenshot PNG, WARC (web archive), plain text, Git repos, audio and video files, and readability-extracted article text. Each URL gets saved in all available formats simultaneously.

Can ArchiveBox archive entire websites?+

Yes. Use the --depth=1 flag to follow all links on a page and archive them too. Be cautious with depth values above 1, as the number of pages grows exponentially.

How does ArchiveBox store data?+

Each snapshot is stored as a directory containing the archived files and a JSON index. This makes archives portable and tool-independent. You can browse archives offline by opening the HTML files directly.

Can I search my ArchiveBox archive?+

Yes. The web UI provides full-text search across all archived content. You can search by URL, page title, tags, or content within the archived pages.

Does ArchiveBox support scheduled archiving?+

Yes. Set up a cron job to run archivebox add with RSS feed URLs or bookmark export URLs on a schedule. ArchiveBox deduplicates URLs automatically, so repeated runs only archive new content.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets