What is ArchiveBox — Self-Hosted Web Archiving Platform?

ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.

Is ArchiveBox — Self-Hosted Web Archiving Platform free to use?

Yes. ArchiveBox — Self-Hosted Web Archiving Platform is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install ArchiveBox — Self-Hosted Web Archiving Platform?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ArchiveBox — Self-Hosted Web Archiving Platform

Introduction

ArchiveBox preserves web content you care about before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats — HTML, PDF, screenshot, WARC — so you always have a local copy, even when the original goes offline.

What ArchiveBox Does

Archives web pages in multiple formats: HTML, PDF, screenshot, WARC, media, and Git repos
Accepts input from bookmarks exports, browser history, RSS feeds, and plain URL lists
Provides a web UI for browsing, searching, and managing your archive
Extracts and saves embedded media including images, videos, audio, and documents
Schedules automatic archiving of RSS feeds and bookmark sources on a cron interval

Architecture Overview

ArchiveBox is a Python application built on Django with a SQLite database by default. It orchestrates a suite of external tools — wget, Chrome headless, youtube-dl, readability, mercury-parser — to capture pages in multiple formats simultaneously. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.

Self-Hosting & Configuration

Deploy via Docker Compose, pip install, or Homebrew on macOS and Linux
Configure output formats, archiving depth, and tool preferences via ArchiveBox.conf
Set up scheduled imports from RSS feeds, Pinboard, Pocket, or browser bookmark exports
Use SQLite for small archives or PostgreSQL for larger collections
Serve archives publicly or restrict access with Django authentication

Key Features

Multi-format preservation ensures content survives even if one format fails
Full-text search across all archived page content and metadata
Browser extension and bookmarklet for one-click archiving of any page
Portable archive format — each snapshot is a standalone folder of standard files
Deduplication and incremental archiving to save storage on repeated URLs

Comparison with Similar Tools

Wallabag — Read-it-later app focused on article reading, not full multi-format archiving
SingleFile — Browser extension that saves single pages, but lacks batch processing and scheduling
HTTrack — Classic website copier for mirroring entire sites, but no PDF/screenshot/WARC support
Webrecorder/Conifer — WARC-focused archiving with replay, but requires more technical setup
Pocket — Cloud-based bookmarking without self-hosted option or multi-format local storage

FAQ

Q: How much storage does ArchiveBox use per page? A: A typical page with all formats enabled uses 5-50 MB. You can disable formats like screenshots or WARC to reduce storage significantly.

Q: Can I archive pages behind login walls? A: Yes. You can configure browser cookies or use a logged-in Chrome profile to archive authenticated content.

Q: Does ArchiveBox respect robots.txt? A: By default, ArchiveBox respects robots.txt for wget fetches, but you can override this in configuration for personal archiving purposes.

Q: Can I export my archive to another tool? A: Yes. Archives are stored as standard files (HTML, PDF, PNG, WARC) in plain directories that any tool or file browser can access directly.

ArchiveBox — Self-Hosted Web Archiving Platform

Introduction

What ArchiveBox Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Apache Beam — Unified Batch and Stream Data Processing

DataHub — Open-Source Data Discovery & Governance Platform

Apache DataFusion — Fast In-Process SQL Query Engine in Rust