Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsApr 16, 2026·3 min de lectura

ArchiveBox — Self-Hosted Web Archiving Platform

ArchiveBox is an open-source self-hosted web archiver that saves URLs as local HTML, PDF, screenshots, WARC, and more. Feed it bookmarks, browser history, or RSS feeds and it preserves everything for offline access.

Listo para agents

Instalación con revisión previa

Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.

Needs Confirmation · 64/100Política: confirmar
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
ArchiveBox Overview
Comando con revisión previa
npx -y tokrepo@latest install 358da384-39db-11f1-9bc6-00163e2b0d79 --target codex

Primero dry-run, confirma las escrituras y luego ejecuta este comando.

TL;DR
ArchiveBox preserves web pages as HTML, PDF, screenshots, and WARC for offline access.
§01

What it is

ArchiveBox is an open-source self-hosted web archiver that preserves web content before it disappears. It takes URLs from bookmarks, browser history, RSS feeds, or plain text and saves them in multiple formats: HTML, PDF, screenshot, WARC, media files, and Git repos.

ArchiveBox is built for anyone who wants a personal internet archive. It runs as a Python/Django application with a web UI for browsing, searching, and managing your archive. Each snapshot is stored as a directory of files with a JSON index, making archives portable and tool-independent.

§02

How it saves time or tokens

ArchiveBox automates the tedious work of saving web pages manually. Instead of right-clicking 'Save as' on each page, you feed it a list of URLs (from bookmarks export, browser history, or RSS feeds) and it captures everything in multiple formats simultaneously. Scheduled archiving via cron means your RSS feeds and bookmark sources are preserved automatically. The multi-format approach ensures you always have a readable copy even when the original format fails to render.

§03

How to use

  1. Start ArchiveBox with Docker Compose:
curl -fsSL https://docker-compose.archivebox.io -o docker-compose.yml
docker compose up -d
  1. Add URLs to archive:
docker compose exec archivebox archivebox add 'https://example.com'
docker compose exec archivebox archivebox add --depth=1 'https://news.ycombinator.com'
  1. Browse your archive through the web UI at http://localhost:8000, search by title, URL, or content.
§04

Example

Scheduled archiving of an RSS feed with cron:

# Add to crontab: archive new items from HN RSS every hour
0 * * * * cd /path/to/archivebox && \
  docker compose exec -T archivebox archivebox add \
  --depth=0 \
  'https://hnrss.org/newest?points=100'

# Archive all links from your Pinboard bookmarks
docker compose exec archivebox archivebox add \
  'https://feeds.pinboard.in/json/v1/posts/all?auth_token=user:TOKEN'

# Import browser history
docker compose exec archivebox archivebox add \
  --parser=pocket_html < pocket_export.html
§05

Related on TokRepo

§06

Common pitfalls

  • Running ArchiveBox without sufficient disk space causes archives to fail silently. Monitor disk usage, especially when archiving media-heavy sites with --depth=1.
  • The default SQLite database works for small archives but slows down with tens of thousands of snapshots. Consider switching to PostgreSQL for larger installations.
  • Archiving dynamic JavaScript-heavy sites without Chrome headless installed produces incomplete snapshots. Ensure Chromium is available in your Docker or host environment.

Preguntas frecuentes

What archive formats does ArchiveBox support?+

ArchiveBox saves pages as static HTML (via wget), PDF (via Chrome headless), screenshot PNG, WARC (web archive), plain text, Git repos, audio and video files, and readability-extracted article text. Each URL gets saved in all available formats simultaneously.

Can ArchiveBox archive entire websites?+

Yes. Use the --depth=1 flag to follow all links on a page and archive them too. Be cautious with depth values above 1, as the number of pages grows exponentially.

How does ArchiveBox store data?+

Each snapshot is stored as a directory containing the archived files and a JSON index. This makes archives portable and tool-independent. You can browse archives offline by opening the HTML files directly.

Can I search my ArchiveBox archive?+

Yes. The web UI provides full-text search across all archived content. You can search by URL, page title, tags, or content within the archived pages.

Does ArchiveBox support scheduled archiving?+

Yes. Set up a cron job to run archivebox add with RSS feed URLs or bookmark export URLs on a schedule. ArchiveBox deduplicates URLs automatically, so repeated runs only archive new content.

Referencias (3)

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados