ScriptsApr 16, 2026·3 min read

ArchiveBox — Self-Hosted Web Archiving Platform

ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.

Introduction

ArchiveBox preserves web content you care about before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats — HTML, PDF, screenshot, WARC — so you always have a local copy, even when the original goes offline.

What ArchiveBox Does

  • Archives web pages in multiple formats: HTML, PDF, screenshot, WARC, media, and Git repos
  • Accepts input from bookmarks exports, browser history, RSS feeds, and plain URL lists
  • Provides a web UI for browsing, searching, and managing your archive
  • Extracts and saves embedded media including images, videos, audio, and documents
  • Schedules automatic archiving of RSS feeds and bookmark sources on a cron interval

Architecture Overview

ArchiveBox is a Python application built on Django with a SQLite database by default. It orchestrates a suite of external tools — wget, Chrome headless, youtube-dl, readability, mercury-parser — to capture pages in multiple formats simultaneously. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.

Self-Hosting & Configuration

  • Deploy via Docker Compose, pip install, or Homebrew on macOS and Linux
  • Configure output formats, archiving depth, and tool preferences via ArchiveBox.conf
  • Set up scheduled imports from RSS feeds, Pinboard, Pocket, or browser bookmark exports
  • Use SQLite for small archives or PostgreSQL for larger collections
  • Serve archives publicly or restrict access with Django authentication

Key Features

  • Multi-format preservation ensures content survives even if one format fails
  • Full-text search across all archived page content and metadata
  • Browser extension and bookmarklet for one-click archiving of any page
  • Portable archive format — each snapshot is a standalone folder of standard files
  • Deduplication and incremental archiving to save storage on repeated URLs

Comparison with Similar Tools

  • Wallabag — Read-it-later app focused on article reading, not full multi-format archiving
  • SingleFile — Browser extension that saves single pages, but lacks batch processing and scheduling
  • HTTrack — Classic website copier for mirroring entire sites, but no PDF/screenshot/WARC support
  • Webrecorder/Conifer — WARC-focused archiving with replay, but requires more technical setup
  • Pocket — Cloud-based bookmarking without self-hosted option or multi-format local storage

FAQ

Q: How much storage does ArchiveBox use per page? A: A typical page with all formats enabled uses 5-50 MB. You can disable formats like screenshots or WARC to reduce storage significantly.

Q: Can I archive pages behind login walls? A: Yes. You can configure browser cookies or use a logged-in Chrome profile to archive authenticated content.

Q: Does ArchiveBox respect robots.txt? A: By default, ArchiveBox respects robots.txt for wget fetches, but you can override this in configuration for personal archiving purposes.

Q: Can I export my archive to another tool? A: Yes. Archives are stored as standard files (HTML, PDF, PNG, WARC) in plain directories that any tool or file browser can access directly.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets