# Paperless-ngx — Self-Hosted Document Management with OCR

> Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.

## Install

Save as a script file and run:

## Quick Use

```bash
docker run -d --name paperless 
  -p 8000:8000 
  -v paperless-data:/usr/src/paperless/data 
  -v paperless-media:/usr/src/paperless/media 
  -v paperless-consume:/usr/src/paperless/consume 
  -e PAPERLESS_SECRET_KEY=your-secret-key 
  ghcr.io/paperless-ngx/paperless-ngx:latest
```

Open `http://localhost:8000` — create admin account, then drop documents into the consume folder.

## Intro

**Paperless-ngx** is a community-supported document management system that transforms your physical and digital documents into a searchable online archive. It automatically OCRs, indexes, tags, and categorizes every document you feed it — making decades of paperwork instantly searchable.

With 38K+ GitHub stars and GPL-3.0 license, Paperless-ngx is the most popular self-hosted DMS, trusted by thousands of users for going paperless with complete privacy and data ownership.

## What Paperless-ngx Does

- **Document Ingestion**: Drop PDFs, images, or Office docs into a folder — Paperless processes them automatically
- **OCR**: Tesseract-powered OCR extracts text from scanned documents and images (100+ languages)
- **Full-Text Search**: Search across all document content, not just filenames
- **Auto-Tagging**: Machine learning-powered automatic classification with tags, document types, and correspondents
- **Email Consumption**: Automatically import documents from email attachments
- **Scanner Integration**: Works with any scanner that can output to a folder or email
- **Mobile Scanning**: Upload from phone camera or scanning apps
- **File Naming**: Automatic, template-based file renaming and organization on disk
- **Multi-user**: Role-based access with per-user document visibility

## Architecture

```
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Web UI      │────▶│  Paperless   │────▶│  PostgreSQL  │
│  (Angular)   │     │  Server      │     │  + Redis     │
└──────────────┘     │  (Django)    │     └──────────────┘
                     └──────┬───────┘
                            │
         ┌──────────────────┼──────────────────┐
         │                  │                  │
  ┌──────┴──┐        ┌─────┴───┐       ┌──────┴──┐
  │Consume  │        │Tesseract│       │ Gotenberg│
  │ Folder  │        │  (OCR)  │       │(Convert) │
  └─────────┘        └─────────┘       └──────────┘
```

## Self-Hosting

### Docker Compose (Recommended)

```yaml
services:
  paperless-webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    ports:
      - "8000:8000"
    environment:
      PAPERLESS_REDIS: redis://redis:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: paperless
      PAPERLESS_SECRET_KEY: your-very-long-secret-key
      PAPERLESS_OCR_LANGUAGE: eng+chi_sim
      PAPERLESS_TIME_ZONE: Asia/Shanghai
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - consume:/usr/src/paperless/consume
      - export:/usr/src/paperless/export
    depends_on:
      - db
      - redis
      - gotenberg
      - tika

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless
      POSTGRES_DB: paperless
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine

  gotenberg:
    image: gotenberg/gotenberg:8
    command: gotenberg --chromium-disable-javascript=true

  tika:
    image: apache/tika:latest

volumes:
  data:
  media:
  pgdata:
  consume:
  export:
```

## Workflow

### 1. Ingest Documents

```
Methods:
├── Drop files into /consume folder
├── Upload via web UI (drag & drop)
├── Email (IMAP polling)
├── Mobile app (scan & upload)
└── API upload
```

### 2. Automatic Processing

```
Document dropped
  → Detect file type (PDF, image, Office, etc.)
  → Convert to PDF/A if needed (Gotenberg)
  → OCR text extraction (Tesseract)
  → Full-text indexing
  → ML-based auto-tagging
  → Auto-assign correspondent & document type
  → Rename and archive file
  → Thumbnail generation
```

### 3. Search & Organize

```
Search: "invoice 2024 electricity"
  → Results with highlighted matching text
  → Filter by date range, tags, correspondent
  → Sort by relevance, date, or title
```

## Key Features

### Auto-Classification

Paperless learns from your manual tagging and starts auto-classifying:

```
After you tag 10+ electricity bills:
  → New electricity bills auto-tagged as "Bills" + "Electricity"
  → Correspondent auto-set to "Power Company"
  → Document type auto-set to "Invoice"
```

### File Naming Templates

```
# Template: {created_year}/{correspondent}/{title}
# Result:
2024/
├── Amazon/
│   ├── Order-123-receipt.pdf
│   └── Order-456-receipt.pdf
├── City Power/
│   ├── January-2024-bill.pdf
│   └── February-2024-bill.pdf
└── Insurance Co/
    └── Policy-renewal-2024.pdf
```

### API

```bash
# Upload document
curl -X POST http://localhost:8000/api/documents/post_document/ 
  -H "Authorization: Token YOUR_TOKEN" 
  -F "document=@invoice.pdf" 
  -F "tags=1,2" 
  -F "correspondent=3"

# Search documents
curl "http://localhost:8000/api/documents/?query=invoice+2024" 
  -H "Authorization: Token YOUR_TOKEN"
```

## Paperless-ngx vs Alternatives

| Feature | Paperless-ngx | Docspell | Mayan EDMS | Teedy |
|---------|--------------|----------|------------|-------|
| Open Source | Yes (GPL-3.0) | Yes (AGPL) | Yes (Apache) | Yes (GPL) |
| GitHub Stars | 38K | 1.5K | 500 | 2K |
| OCR | Tesseract (100+ lang) | Tesseract | Tesseract | Tesseract |
| Auto-tagging | ML-based | Rule-based | Manual | Tags |
| Email intake | Yes | Yes | Yes | No |
| Mobile app | Community apps | No | No | No |
| Full-text search | Yes (Whoosh) | Yes (Solr) | Yes | Yes |

## FAQ

**Q: Does it support Chinese OCR?**
A: Yes. Set `PAPERLESS_OCR_LANGUAGE=eng+chi_sim` to recognize both English and Simplified Chinese. OCR quality depends on the scan quality — 300 DPI or higher works best.

**Q: How much storage do I need?**
A: It depends on document count and size. Each document generates the original file + archived PDF + thumbnail, roughly 1.5–2× the original size. About 1,000 typical documents need 2–5 GB of storage.

**Q: Can multiple users share it?**
A: Yes. It supports multiple users, each with distinct permissions and document visibility. Administrators can set global tags and document types.

## Sources & Credits

- GitHub: [paperless-ngx/paperless-ngx](https://github.com/paperless-ngx/paperless-ngx) — 38K+ ⭐ | GPL-3.0
- Website: [docs.paperless-ngx.com](https://docs.paperless-ngx.com)

---
Source: https://tokrepo.com/en/workflows/de0041a5-34b7-11f1-9bc6-00163e2b0d79
Author: Script Depot