Paperless-ngx — Self-Hosted Document Management with OCR
Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.
What it is
Paperless-ngx is an open-source document management system that digitizes, OCRs, indexes, and archives both physical and digital documents. Drop a scanned document or PDF into the system, and it automatically extracts text via OCR, applies tags, assigns correspondents, and makes everything searchable. It runs self-hosted via Docker.
This tool is for individuals and organizations who want to go paperless without relying on cloud document services. It is also useful for developers building document processing pipelines.
How it saves time or tokens
Paperless-ngx automates the tedious process of organizing documents. The OCR engine extracts text from scanned images, making them searchable without manual data entry. Machine learning-based auto-tagging learns your categorization patterns and applies them to new documents. For AI workflows, the full-text search API provides document retrieval that can feed into RAG systems.
How to use
- Deploy Paperless-ngx via Docker Compose.
- Configure consumption directories for document ingestion.
- Drop documents into the consumption folder.
- Search and manage documents through the web UI.
# Clone and start with Docker Compose
git clone https://github.com/paperless-ngx/paperless-ngx.git
cd paperless-ngx/docker/compose
# Copy environment template
cp docker-compose.env.example docker-compose.env
# Start the stack
docker compose up -d
# Create admin user
docker compose exec webserver python3 manage.py createsuperuser
# Access at http://localhost:8000
Example
API usage for document search:
import requests
api_url = 'http://localhost:8000/api'
headers = {'Authorization': 'Token your-api-token'}
# Search documents
response = requests.get(
f'{api_url}/documents/',
headers=headers,
params={'query': 'invoice 2026'}
)
for doc in response.json()['results']:
print(f"{doc['title']} - {doc['created_date']}")
print(f"Tags: {doc['tags']}")
Related on TokRepo
- Self-hosted solutions — More self-hosted tools
- Document tools — Document processing and management
Common pitfalls
- OCR quality depends on scan quality. Ensure documents are scanned at 300+ DPI for reliable text extraction.
- The initial setup requires Docker and basic Docker Compose knowledge. The configuration file has many options.
- Storage grows with your document collection. Plan disk space for both originals and generated thumbnails.
- Auto-tagging requires training data. The ML classifier needs at least 10 documents per tag before it starts making useful suggestions.
- Paperless-ngx does not handle encrypted PDFs. Decrypt them before ingestion.
- Review the official documentation before deploying to production to ensure compatibility with your specific environment and requirements.
- Start with default settings and customize incrementally. Changing too many configuration options at once makes debugging harder.
- Keep your installation updated to the latest stable version. Security patches and bug fixes are released regularly.
Frequently Asked Questions
Paperless-ngx uses Tesseract OCR by default, which supports over 100 languages. It also supports alternative OCR backends. The OCR runs automatically on ingested documents.
Yes. Paperless-ngx has ARM Docker images that run on Raspberry Pi 4 and later. Performance will be slower than on a full server, especially for OCR processing, but it works for personal document management.
The web UI is responsive and works on mobile browsers. There are also community-maintained mobile apps for Android and iOS that connect to your Paperless-ngx instance.
Paperless-ngx uses a machine learning classifier that learns from your tagging patterns. After you manually tag enough documents, it suggests tags for new documents. The more you use it, the more accurate suggestions become.
Yes. Drop PDF, PNG, JPG, TIFF, and other document formats into the consumption directory. Paperless-ngx processes them automatically. You can also use the web UI or API to upload documents directly.
Citations (3)
- Paperless-ngx GitHub— Paperless-ngx is an open-source document management system
- Paperless-ngx Docs— Paperless-ngx documentation and setup
- Tesseract GitHub— Tesseract OCR engine
Related on TokRepo
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.