What Paperless-ngx Does
- Document Ingestion: Drop PDFs, images, or Office docs into a folder — Paperless processes them automatically
- OCR: Tesseract-powered OCR extracts text from scanned documents and images (100+ languages)
- Full-Text Search: Search across all document content, not just filenames
- Auto-Tagging: Machine learning-powered automatic classification with tags, document types, and correspondents
- Email Consumption: Automatically import documents from email attachments
- Scanner Integration: Works with any scanner that can output to a folder or email
- Mobile Scanning: Upload from phone camera or scanning apps
- File Naming: Automatic, template-based file renaming and organization on disk
- Multi-user: Role-based access with per-user document visibility
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Web UI │────▶│ Paperless │────▶│ PostgreSQL │
│ (Angular) │ │ Server │ │ + Redis │
└──────────────┘ │ (Django) │ └──────────────┘
└──────┬───────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌──────┴──┐ ┌─────┴───┐ ┌──────┴──┐
│Consume │ │Tesseract│ │ Gotenberg│
│ Folder │ │ (OCR) │ │(Convert) │
└─────────┘ └─────────┘ └──────────┘Self-Hosting
Docker Compose (Recommended)
services:
paperless-webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
ports:
- "8000:8000"
environment:
PAPERLESS_REDIS: redis://redis:6379
PAPERLESS_DBHOST: db
PAPERLESS_DBNAME: paperless
PAPERLESS_DBUSER: paperless
PAPERLESS_DBPASS: paperless
PAPERLESS_SECRET_KEY: your-very-long-secret-key
PAPERLESS_OCR_LANGUAGE: eng+chi_sim
PAPERLESS_TIME_ZONE: Asia/Shanghai
volumes:
- data:/usr/src/paperless/data
- media:/usr/src/paperless/media
- consume:/usr/src/paperless/consume
- export:/usr/src/paperless/export
depends_on:
- db
- redis
- gotenberg
- tika
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: paperless
POSTGRES_PASSWORD: paperless
POSTGRES_DB: paperless
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: redis:7-alpine
gotenberg:
image: gotenberg/gotenberg:8
command: gotenberg --chromium-disable-javascript=true
tika:
image: apache/tika:latest
volumes:
data:
media:
pgdata:
consume:
export:Workflow
1. Ingest Documents
Methods:
├── Drop files into /consume folder
├── Upload via web UI (drag & drop)
├── Email (IMAP polling)
├── Mobile app (scan & upload)
└── API upload2. Automatic Processing
Document dropped
→ Detect file type (PDF, image, Office, etc.)
→ Convert to PDF/A if needed (Gotenberg)
→ OCR text extraction (Tesseract)
→ Full-text indexing
→ ML-based auto-tagging
→ Auto-assign correspondent & document type
→ Rename and archive file
→ Thumbnail generation3. Search & Organize
Search: "invoice 2024 electricity"
→ Results with highlighted matching text
→ Filter by date range, tags, correspondent
→ Sort by relevance, date, or titleKey Features
Auto-Classification
Paperless learns from your manual tagging and starts auto-classifying:
After you tag 10+ electricity bills:
→ New electricity bills auto-tagged as "Bills" + "Electricity"
→ Correspondent auto-set to "Power Company"
→ Document type auto-set to "Invoice"File Naming Templates
# Template: {created_year}/{correspondent}/{title}
# Result:
2024/
├── Amazon/
│ ├── Order-123-receipt.pdf
│ └── Order-456-receipt.pdf
├── City Power/
│ ├── January-2024-bill.pdf
│ └── February-2024-bill.pdf
└── Insurance Co/
└── Policy-renewal-2024.pdfAPI
# Upload document
curl -X POST http://localhost:8000/api/documents/post_document/
-H "Authorization: Token YOUR_TOKEN"
-F "document=@invoice.pdf"
-F "tags=1,2"
-F "correspondent=3"
# Search documents
curl "http://localhost:8000/api/documents/?query=invoice+2024"
-H "Authorization: Token YOUR_TOKEN"Paperless-ngx vs Alternatives
| Feature | Paperless-ngx | Docspell | Mayan EDMS | Teedy |
|---|---|---|---|---|
| Open Source | Yes (GPL-3.0) | Yes (AGPL) | Yes (Apache) | Yes (GPL) |
| GitHub Stars | 38K | 1.5K | 500 | 2K |
| OCR | Tesseract (100+ lang) | Tesseract | Tesseract | Tesseract |
| Auto-tagging | ML-based | Rule-based | Manual | Tags |
| Email intake | Yes | Yes | Yes | No |
| Mobile app | Community apps | No | No | No |
| Full-text search | Yes (Whoosh) | Yes (Solr) | Yes | Yes |
FAQ
Q: Does it support Chinese OCR?
A: Yes. Set PAPERLESS_OCR_LANGUAGE=eng+chi_sim to recognize both English and Simplified Chinese. OCR quality depends on the scan quality — 300 DPI or higher works best.
Q: How much storage do I need? A: It depends on document count and size. Each document generates the original file + archived PDF + thumbnail, roughly 1.5–2× the original size. About 1,000 typical documents need 2–5 GB of storage.
Q: Can multiple users share it? A: Yes. It supports multiple users, each with distinct permissions and document visibility. Administrators can set global tags and document types.
Sources & Credits
- GitHub: paperless-ngx/paperless-ngx — 38K+ ⭐ | GPL-3.0
- Website: docs.paperless-ngx.com