# ArchiveBox — Self-Hosted Web Archiving Platform

> ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.

## Install

Save as a script file and run:

# ArchiveBox — Self-Hosted Web Archiving Platform

## Quick Use
```bash
# Docker Compose (recommended)
curl -fsSL https://docker-compose.archivebox.io -o docker-compose.yml
docker compose up -d

# Add URLs to archive
docker compose exec archivebox archivebox add "https://example.com"
docker compose exec archivebox archivebox add --depth=1 "https://news.ycombinator.com"
```

## Introduction
ArchiveBox preserves web content you care about before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats — HTML, PDF, screenshot, WARC — so you always have a local copy, even when the original goes offline.

## What ArchiveBox Does
- Archives web pages in multiple formats: HTML, PDF, screenshot, WARC, media, and Git repos
- Accepts input from bookmarks exports, browser history, RSS feeds, and plain URL lists
- Provides a web UI for browsing, searching, and managing your archive
- Extracts and saves embedded media including images, videos, audio, and documents
- Schedules automatic archiving of RSS feeds and bookmark sources on a cron interval

## Architecture Overview
ArchiveBox is a Python application built on Django with a SQLite database by default. It orchestrates a suite of external tools — wget, Chrome headless, youtube-dl, readability, mercury-parser — to capture pages in multiple formats simultaneously. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.

## Self-Hosting & Configuration
- Deploy via Docker Compose, pip install, or Homebrew on macOS and Linux
- Configure output formats, archiving depth, and tool preferences via ArchiveBox.conf
- Set up scheduled imports from RSS feeds, Pinboard, Pocket, or browser bookmark exports
- Use SQLite for small archives or PostgreSQL for larger collections
- Serve archives publicly or restrict access with Django authentication

## Key Features
- Multi-format preservation ensures content survives even if one format fails
- Full-text search across all archived page content and metadata
- Browser extension and bookmarklet for one-click archiving of any page
- Portable archive format — each snapshot is a standalone folder of standard files
- Deduplication and incremental archiving to save storage on repeated URLs

## Comparison with Similar Tools
- **Wallabag** — Read-it-later app focused on article reading, not full multi-format archiving
- **SingleFile** — Browser extension that saves single pages, but lacks batch processing and scheduling
- **HTTrack** — Classic website copier for mirroring entire sites, but no PDF/screenshot/WARC support
- **Webrecorder/Conifer** — WARC-focused archiving with replay, but requires more technical setup
- **Pocket** — Cloud-based bookmarking without self-hosted option or multi-format local storage

## FAQ
**Q: How much storage does ArchiveBox use per page?**
A: A typical page with all formats enabled uses 5-50 MB. You can disable formats like screenshots or WARC to reduce storage significantly.

**Q: Can I archive pages behind login walls?**
A: Yes. You can configure browser cookies or use a logged-in Chrome profile to archive authenticated content.

**Q: Does ArchiveBox respect robots.txt?**
A: By default, ArchiveBox respects robots.txt for wget fetches, but you can override this in configuration for personal archiving purposes.

**Q: Can I export my archive to another tool?**
A: Yes. Archives are stored as standard files (HTML, PDF, PNG, WARC) in plain directories that any tool or file browser can access directly.

## Sources
- https://github.com/ArchiveBox/ArchiveBox
- https://archivebox.io

---
Source: https://tokrepo.com/en/workflows/358da384-39db-11f1-9bc6-00163e2b0d79
Author: Script Depot