SkillsApr 10, 2026·3 min read

Paperless-ngx — Self-Hosted Document Management with OCR

Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.

Script Depot · Community

Agent ready

Safe staging for this asset

This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.

Stage only · 29/100Policy: stage

Agent surface

Any MCP/CLI agent

Kind

Skill

Install

Stage only

Trust

Trust: Established

Entrypoint

step-1.md

Safe staging command

npx -y tokrepo@latest install de0041a5-34b7-11f1-9bc6-00163e2b0d79 --target codex

Stages files first; activation requires review of the staged README and plan.

TL;DR

Paperless-ngx scans, OCRs, and indexes your documents for full-text search, all self-hosted.

§01

What it is

Paperless-ngx is an open-source document management system that digitizes, OCRs, indexes, and archives both physical and digital documents. Drop a scanned document or PDF into the system, and it automatically extracts text via OCR, applies tags, assigns correspondents, and makes everything searchable. It runs self-hosted via Docker.

This tool is for individuals and organizations who want to go paperless without relying on cloud document services. It is also useful for developers building document processing pipelines.

§02

How it saves time or tokens

Paperless-ngx automates the tedious process of organizing documents. The OCR engine extracts text from scanned images, making them searchable without manual data entry. Machine learning-based auto-tagging learns your categorization patterns and applies them to new documents. For AI workflows, the full-text search API provides document retrieval that can feed into RAG systems.

§03

How to use

Deploy Paperless-ngx via Docker Compose.
Configure consumption directories for document ingestion.
Drop documents into the consumption folder.
Search and manage documents through the web UI.

# Clone and start with Docker Compose
git clone https://github.com/paperless-ngx/paperless-ngx.git
cd paperless-ngx/docker/compose

# Copy environment template
cp docker-compose.env.example docker-compose.env

# Start the stack
docker compose up -d

# Create admin user
docker compose exec webserver python3 manage.py createsuperuser

# Access at http://localhost:8000

§04

Example

API usage for document search:

import requests

api_url = 'http://localhost:8000/api'
headers = {'Authorization': 'Token your-api-token'}

# Search documents
response = requests.get(
    f'{api_url}/documents/',
    headers=headers,
    params={'query': 'invoice 2026'}
)

for doc in response.json()['results']:
    print(f"{doc['title']} - {doc['created_date']}")
    print(f"Tags: {doc['tags']}")

§05

Related on TokRepo

Self-hosted solutions — More self-hosted tools
Document tools — Document processing and management

§06

Common pitfalls

OCR quality depends on scan quality. Ensure documents are scanned at 300+ DPI for reliable text extraction.
The initial setup requires Docker and basic Docker Compose knowledge. The configuration file has many options.
Storage grows with your document collection. Plan disk space for both originals and generated thumbnails.
Auto-tagging requires training data. The ML classifier needs at least 10 documents per tag before it starts making useful suggestions.
Paperless-ngx does not handle encrypted PDFs. Decrypt them before ingestion.
Review the official documentation before deploying to production to ensure compatibility with your specific environment and requirements.
Start with default settings and customize incrementally. Changing too many configuration options at once makes debugging harder.
Keep your installation updated to the latest stable version. Security patches and bug fixes are released regularly.

Frequently Asked Questions

What OCR engine does Paperless-ngx use?+

Paperless-ngx uses Tesseract OCR by default, which supports over 100 languages. It also supports alternative OCR backends. The OCR runs automatically on ingested documents.

Can I use Paperless-ngx on a Raspberry Pi?+

Yes. Paperless-ngx has ARM Docker images that run on Raspberry Pi 4 and later. Performance will be slower than on a full server, especially for OCR processing, but it works for personal document management.

Does it support mobile access?+

The web UI is responsive and works on mobile browsers. There are also community-maintained mobile apps for Android and iOS that connect to your Paperless-ngx instance.

How does auto-tagging work?+

Paperless-ngx uses a machine learning classifier that learns from your tagging patterns. After you manually tag enough documents, it suggests tags for new documents. The more you use it, the more accurate suggestions become.

Can I import existing digital documents?+

Yes. Drop PDF, PNG, JPG, TIFF, and other document formats into the consumption directory. Paperless-ngx processes them automatically. You can also use the web UI or API to upload documents directly.

Citations (3)

Paperless-ngx GitHub— Paperless-ngx is an open-source document management system
Paperless-ngx Docs— Paperless-ngx documentation and setup
Tesseract GitHub— Tesseract OCR engine

Related on TokRepo

Self-hosted tools Document tools Featured workflows

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

Planka — Self-Hosted Trello-Like Project Management

Planka is a real-time, self-hosted kanban board for project management with drag-and-drop cards, file attachments, and multi-user collaboration.

Skills

AI Open Source

Shiori — Simple Self-Hosted Bookmark Manager

Shiori is a lightweight self-hosted bookmark manager written in Go with full-text search, archiving, and a clean web interface for organizing your saved links.

Skills

Script Depot

Wallabag — Self-Hosted Read-It-Later App

Wallabag is a self-hosted read-it-later application that saves web articles for offline reading with tagging, annotations, and full-text search.

Skills

Script Depot

Cachet — Open Source Self-Hosted Status Page System

Cachet is a self-hosted status page application that helps teams communicate service availability and incidents to users through a clean web dashboard.

Skills

Script Depot