Cette page est affichée en anglais. Une traduction française est en cours.

SkillsApr 16, 2026·3 min de lecture

ArchiveBox — Self-Hosted Web Archiving Platform

ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.

Script Depot · Community

Prêt pour agents

Installation avec revue préalable

Cet actif nécessite une revue. Le prompt copié demande un dry-run, affiche les écritures, puis continue seulement après confirmation.

Needs Confirmation · 64/100Policy : confirmer

Surface agent

Tout agent MCP/CLI

Type

Skill

Installation

Single

Confiance

Confiance : Established

Point d'entrée

ArchiveBox Overview

Commande avec revue préalable

npx -y tokrepo@latest install 358da384-39db-11f1-9bc6-00163e2b0d79 --target codex

Dry-run d'abord, confirmez les écritures, puis lancez cette commande.

TL;DR

ArchiveBox preserves web pages as HTML, PDF, screenshots, and WARC for offline access.

§01

What it is

ArchiveBox is an open-source self-hosted web archiver that preserves web content before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats: HTML, PDF, screenshot, WARC, media files, and Git repos.

ArchiveBox is built for anyone who wants a personal internet archive. It runs as a Python/Django application with a web UI for browsing, searching, and managing your archive. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.

§02

How it saves time or tokens

ArchiveBox automates the tedious work of saving web pages manually. Instead of right-clicking 'Save as' on each page, you feed it a list of URLs (from bookmarks export, browser history, or RSS feeds) and it captures everything in multiple formats simultaneously. Scheduled archiving via cron means your RSS feeds and bookmark sources are preserved automatically. The multi-format approach ensures you always have a readable copy even when the original format fails to render.

§03

How to use

Start ArchiveBox with Docker Compose:

curl -fsSL https://docker-compose.archivebox.io -o docker-compose.yml
docker compose up -d

Add URLs to archive:

docker compose exec archivebox archivebox add 'https://example.com'
docker compose exec archivebox archivebox add --depth=1 'https://news.ycombinator.com'

Browse your archive through the web UI at http://localhost:8000, search by title, URL, or content.

§04

Example

Scheduled archiving of an RSS feed with cron:

# Add to crontab: archive new items from HN RSS every hour
0 * * * * cd /path/to/archivebox && \
  docker compose exec -T archivebox archivebox add \
  --depth=0 \
  'https://hnrss.org/newest?points=100'

# Archive all links from your Pinboard bookmarks
docker compose exec archivebox archivebox add \
  'https://feeds.pinboard.in/json/v1/posts/all?auth_token=user:TOKEN'

# Import browser history
docker compose exec archivebox archivebox add \
  --parser=pocket_html < pocket_export.html

§05

Related on TokRepo

Self-hosted tools — More self-hostable tools for data preservation and infrastructure.
Automation tools — Browse automation frameworks for scheduled data tasks.

§06

Common pitfalls

Running ArchiveBox without sufficient disk space causes archives to fail silently. Monitor disk usage, especially when archiving media-heavy sites with --depth=1.
The default SQLite database works for small archives but slows down with tens of thousands of snapshots. Consider switching to PostgreSQL for larger installations.
Archiving dynamic JavaScript-heavy sites without Chrome headless installed produces incomplete snapshots. Ensure Chromium is available in your Docker or host environment.

Questions fréquentes

What archive formats does ArchiveBox support?+

ArchiveBox saves pages as static HTML (via wget), PDF (via Chrome headless), screenshot PNG, WARC (web archive), plain text, Git repos, audio and video files, and readability-extracted article text. Each URL gets saved in all available formats simultaneously.

Can ArchiveBox archive entire websites?+

Yes. Use the --depth=1 flag to follow all links on a page and archive them too. Be cautious with depth values above 1, as the number of pages grows exponentially.

How does ArchiveBox store data?+

Each snapshot is stored as a directory containing the archived files and a JSON index. This makes archives portable and tool-independent. You can browse archives offline by opening the HTML files directly.

Can I search my ArchiveBox archive?+

Yes. The web UI provides full-text search across all archived content. You can search by URL, page title, tags, or content within the archived pages.

Does ArchiveBox support scheduled archiving?+

Yes. Set up a cron job to run archivebox add with RSS feed URLs or bookmark export URLs on a schedule. ArchiveBox deduplicates URLs automatically, so repeated runs only archive new content.

Sources citées (3)

ArchiveBox GitHub— ArchiveBox is an open-source web archiving tool
ArchiveBox Documentation— ArchiveBox setup and configuration guide
IIPC WARC Specification— WARC web archive format specification

En lien sur TokRepo

Self-hosted tools Automation tools Featured workflows

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Halo — Modern Self-Hosted Publishing Platform

Halo is an open-source content management and blogging platform built with Java and Spring Boot. It provides a polished editing experience, a plugin system, and theme marketplace for self-hosted publishing.

Skills

Script Depot

Seafile — Self-Hosted File Sync & Share Platform

Seafile is a high-performance, self-hosted file synchronization and sharing platform with client-side encryption, versioning, and team collaboration.

Skills

Script Depot

Shiori — Simple Self-Hosted Bookmark Manager

Shiori is a lightweight self-hosted bookmark manager written in Go with full-text search, archiving, and a clean web interface for organizing your saved links.

Skills

Script Depot

Wallabag — Self-Hosted Read-It-Later App

Wallabag is a self-hosted read-it-later application that saves web articles for offline reading with tagging, annotations, and full-text search.

Skills

Script Depot