Cette page est affichée en anglais. Une traduction française est en cours.
SkillsApr 16, 2026·3 min de lecture

ArchiveBox — Self-Hosted Web Archiving Platform

ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.

Prêt pour agents

Installation avec revue préalable

Cet actif nécessite une revue. Le prompt copié demande un dry-run, affiche les écritures, puis continue seulement après confirmation.

Needs Confirmation · 64/100Policy : confirmer
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
ArchiveBox Overview
Commande avec revue préalable
npx -y tokrepo@latest install 358da384-39db-11f1-9bc6-00163e2b0d79 --target codex

Dry-run d'abord, confirmez les écritures, puis lancez cette commande.

TL;DR
ArchiveBox preserves web pages as HTML, PDF, screenshots, and WARC for offline access.
§01

What it is

ArchiveBox is an open-source self-hosted web archiver that preserves web content before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats: HTML, PDF, screenshot, WARC, media files, and Git repos.

ArchiveBox is built for anyone who wants a personal internet archive. It runs as a Python/Django application with a web UI for browsing, searching, and managing your archive. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.

§02

How it saves time or tokens

ArchiveBox automates the tedious work of saving web pages manually. Instead of right-clicking 'Save as' on each page, you feed it a list of URLs (from bookmarks export, browser history, or RSS feeds) and it captures everything in multiple formats simultaneously. Scheduled archiving via cron means your RSS feeds and bookmark sources are preserved automatically. The multi-format approach ensures you always have a readable copy even when the original format fails to render.

§03

How to use

  1. Start ArchiveBox with Docker Compose:
curl -fsSL https://docker-compose.archivebox.io -o docker-compose.yml
docker compose up -d
  1. Add URLs to archive:
docker compose exec archivebox archivebox add 'https://example.com'
docker compose exec archivebox archivebox add --depth=1 'https://news.ycombinator.com'
  1. Browse your archive through the web UI at http://localhost:8000, search by title, URL, or content.
§04

Example

Scheduled archiving of an RSS feed with cron:

# Add to crontab: archive new items from HN RSS every hour
0 * * * * cd /path/to/archivebox && \
  docker compose exec -T archivebox archivebox add \
  --depth=0 \
  'https://hnrss.org/newest?points=100'

# Archive all links from your Pinboard bookmarks
docker compose exec archivebox archivebox add \
  'https://feeds.pinboard.in/json/v1/posts/all?auth_token=user:TOKEN'

# Import browser history
docker compose exec archivebox archivebox add \
  --parser=pocket_html < pocket_export.html
§05

Related on TokRepo

§06

Common pitfalls

  • Running ArchiveBox without sufficient disk space causes archives to fail silently. Monitor disk usage, especially when archiving media-heavy sites with --depth=1.
  • The default SQLite database works for small archives but slows down with tens of thousands of snapshots. Consider switching to PostgreSQL for larger installations.
  • Archiving dynamic JavaScript-heavy sites without Chrome headless installed produces incomplete snapshots. Ensure Chromium is available in your Docker or host environment.

Questions fréquentes

What archive formats does ArchiveBox support?+

ArchiveBox saves pages as static HTML (via wget), PDF (via Chrome headless), screenshot PNG, WARC (web archive), plain text, Git repos, audio and video files, and readability-extracted article text. Each URL gets saved in all available formats simultaneously.

Can ArchiveBox archive entire websites?+

Yes. Use the --depth=1 flag to follow all links on a page and archive them too. Be cautious with depth values above 1, as the number of pages grows exponentially.

How does ArchiveBox store data?+

Each snapshot is stored as a directory containing the archived files and a JSON index. This makes archives portable and tool-independent. You can browse archives offline by opening the HTML files directly.

Can I search my ArchiveBox archive?+

Yes. The web UI provides full-text search across all archived content. You can search by URL, page title, tags, or content within the archived pages.

Does ArchiveBox support scheduled archiving?+

Yes. Set up a cron job to run archivebox add with RSS feed URLs or bookmark export URLs on a schedule. ArchiveBox deduplicates URLs automatically, so repeated runs only archive new content.

Sources citées (3)

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires