How do I install Paperless-ngx — Self-Hosted Document Management with OCR?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Paperless-ngx — Self-Hosted Document Management with OCR

What Paperless-ngx Does

Document Ingestion: Drop PDFs, images, or Office docs into a folder — Paperless processes them automatically
OCR: Tesseract-powered OCR extracts text from scanned documents and images (100+ languages)
Full-Text Search: Search across all document content, not just filenames
Auto-Tagging: Machine learning-powered automatic classification with tags, document types, and correspondents
Email Consumption: Automatically import documents from email attachments
Scanner Integration: Works with any scanner that can output to a folder or email
Mobile Scanning: Upload from phone camera or scanning apps
File Naming: Automatic, template-based file renaming and organization on disk
Multi-user: Role-based access with per-user document visibility

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Web UI      │────▶│  Paperless   │────▶│  PostgreSQL  │
│  (Angular)   │     │  Server      │     │  + Redis     │
└──────────────┘     │  (Django)    │     └──────────────┘
                     └──────┬───────┘
                            │
         ┌──────────────────┼──────────────────┐
         │                  │                  │
  ┌──────┴──┐        ┌─────┴───┐       ┌──────┴──┐
  │Consume  │        │Tesseract│       │ Gotenberg│
  │ Folder  │        │  (OCR)  │       │(Convert) │
  └─────────┘        └─────────┘       └──────────┘

Self-Hosting

Docker Compose (Recommended)

services:
  paperless-webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    ports:
      - "8000:8000"
    environment:
      PAPERLESS_REDIS: redis://redis:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: paperless
      PAPERLESS_SECRET_KEY: your-very-long-secret-key
      PAPERLESS_OCR_LANGUAGE: eng+chi_sim
      PAPERLESS_TIME_ZONE: Asia/Shanghai
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - consume:/usr/src/paperless/consume
      - export:/usr/src/paperless/export
    depends_on:
      - db
      - redis
      - gotenberg
      - tika

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless
      POSTGRES_DB: paperless
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine

  gotenberg:
    image: gotenberg/gotenberg:8
    command: gotenberg --chromium-disable-javascript=true

  tika:
    image: apache/tika:latest

volumes:
  data:
  media:
  pgdata:
  consume:
  export:

Workflow

1. Ingest Documents

Methods:
├── Drop files into /consume folder
├── Upload via web UI (drag & drop)
├── Email (IMAP polling)
├── Mobile app (scan & upload)
└── API upload

2. Automatic Processing

Document dropped
  → Detect file type (PDF, image, Office, etc.)
  → Convert to PDF/A if needed (Gotenberg)
  → OCR text extraction (Tesseract)
  → Full-text indexing
  → ML-based auto-tagging
  → Auto-assign correspondent & document type
  → Rename and archive file
  → Thumbnail generation

3. Search & Organize

Search: "invoice 2024 electricity"
  → Results with highlighted matching text
  → Filter by date range, tags, correspondent
  → Sort by relevance, date, or title

Key Features

Auto-Classification

Paperless learns from your manual tagging and starts auto-classifying:

After you tag 10+ electricity bills:
  → New electricity bills auto-tagged as "Bills" + "Electricity"
  → Correspondent auto-set to "Power Company"
  → Document type auto-set to "Invoice"

File Naming Templates

# Template: {created_year}/{correspondent}/{title}
# Result:
2024/
├── Amazon/
│   ├── Order-123-receipt.pdf
│   └── Order-456-receipt.pdf
├── City Power/
│   ├── January-2024-bill.pdf
│   └── February-2024-bill.pdf
└── Insurance Co/
    └── Policy-renewal-2024.pdf

API

# Upload document
curl -X POST http://localhost:8000/api/documents/post_document/ 
  -H "Authorization: Token YOUR_TOKEN" 
  -F "document=@invoice.pdf" 
  -F "tags=1,2" 
  -F "correspondent=3"

# Search documents
curl "http://localhost:8000/api/documents/?query=invoice+2024" 
  -H "Authorization: Token YOUR_TOKEN"

Paperless-ngx vs Alternatives

Feature	Paperless-ngx	Docspell	Mayan EDMS	Teedy
Open Source	Yes (GPL-3.0)	Yes (AGPL)	Yes (Apache)	Yes (GPL)
GitHub Stars	38K	1.5K	500	2K
OCR	Tesseract (100+ lang)	Tesseract	Tesseract	Tesseract
Auto-tagging	ML-based	Rule-based	Manual	Tags
Email intake	Yes	Yes	Yes	No
Mobile app	Community apps	No	No	No
Full-text search	Yes (Whoosh)	Yes (Solr)	Yes	Yes

FAQ

Q: Does it support Chinese OCR? A: Yes. Set PAPERLESS_OCR_LANGUAGE=eng+chi_sim to recognize both English and Simplified Chinese. OCR quality depends on the scan quality — 300 DPI or higher works best.

Q: How much storage do I need? A: It depends on document count and size. Each document generates the original file + archived PDF + thumbnail, roughly 1.5–2× the original size. About 1,000 typical documents need 2–5 GB of storage.

Q: Can multiple users share it? A: Yes. It supports multiple users, each with distinct permissions and document visibility. Administrators can set global tags and document types.

Sources & Credits

GitHub: paperless-ngx/paperless-ngx — 38K+ ⭐ | GPL-3.0
Website: docs.paperless-ngx.com