Paperless-ngx — Self-Hosted Document Management with OCR
Paperless-ngx is an open-source document management system that scans, OCRs, indexes, and archives all your physical and digital documents for full-text search.
Staging sûr pour cet actif
Cet actif est d'abord staged. Le prompt copié demande à l'agent d'inspecter les fichiers staged avant d'activer scripts, config MCP ou config globale.
npx -y tokrepo@latest install de0041a5-34b7-11f1-9bc6-00163e2b0d79 --target codexStage les fichiers d'abord; l'activation exige la revue du README et du plan staged.
What it is
Paperless-ngx is an open-source document management system that digitizes, OCRs, indexes, and archives both physical and digital documents. Drop a scanned document or PDF into the system, and it automatically extracts text via OCR, applies tags, assigns correspondents, and makes everything searchable. It runs self-hosted via Docker.
This tool is for individuals and organizations who want to go paperless without relying on cloud document services. It is also useful for developers building document processing pipelines.
How it saves time or tokens
Paperless-ngx automates the tedious process of organizing documents. The OCR engine extracts text from scanned images, making them searchable without manual data entry. Machine learning-based auto-tagging learns your categorization patterns and applies them to new documents. For AI workflows, the full-text search API provides document retrieval that can feed into RAG systems.
How to use
- Deploy Paperless-ngx via Docker Compose.
- Configure consumption directories for document ingestion.
- Drop documents into the consumption folder.
- Search and manage documents through the web UI.
# Clone and start with Docker Compose
git clone https://github.com/paperless-ngx/paperless-ngx.git
cd paperless-ngx/docker/compose
# Copy environment template
cp docker-compose.env.example docker-compose.env
# Start the stack
docker compose up -d
# Create admin user
docker compose exec webserver python3 manage.py createsuperuser
# Access at http://localhost:8000
Example
API usage for document search:
import requests
api_url = 'http://localhost:8000/api'
headers = {'Authorization': 'Token your-api-token'}
# Search documents
response = requests.get(
f'{api_url}/documents/',
headers=headers,
params={'query': 'invoice 2026'}
)
for doc in response.json()['results']:
print(f"{doc['title']} - {doc['created_date']}")
print(f"Tags: {doc['tags']}")
Related on TokRepo
- Self-hosted solutions — More self-hosted tools
- Document tools — Document processing and management
Common pitfalls
- OCR quality depends on scan quality. Ensure documents are scanned at 300+ DPI for reliable text extraction.
- The initial setup requires Docker and basic Docker Compose knowledge. The configuration file has many options.
- Storage grows with your document collection. Plan disk space for both originals and generated thumbnails.
- Auto-tagging requires training data. The ML classifier needs at least 10 documents per tag before it starts making useful suggestions.
- Paperless-ngx does not handle encrypted PDFs. Decrypt them before ingestion.
- Review the official documentation before deploying to production to ensure compatibility with your specific environment and requirements.
- Start with default settings and customize incrementally. Changing too many configuration options at once makes debugging harder.
- Keep your installation updated to the latest stable version. Security patches and bug fixes are released regularly.
Questions fréquentes
Paperless-ngx uses Tesseract OCR by default, which supports over 100 languages. It also supports alternative OCR backends. The OCR runs automatically on ingested documents.
Yes. Paperless-ngx has ARM Docker images that run on Raspberry Pi 4 and later. Performance will be slower than on a full server, especially for OCR processing, but it works for personal document management.
The web UI is responsive and works on mobile browsers. There are also community-maintained mobile apps for Android and iOS that connect to your Paperless-ngx instance.
Paperless-ngx uses a machine learning classifier that learns from your tagging patterns. After you manually tag enough documents, it suggests tags for new documents. The more you use it, the more accurate suggestions become.
Yes. Drop PDF, PNG, JPG, TIFF, and other document formats into the consumption directory. Paperless-ngx processes them automatically. You can also use the web UI or API to upload documents directly.
Sources citées (3)
- Paperless-ngx GitHub— Paperless-ngx is an open-source document management system
- Paperless-ngx Docs— Paperless-ngx documentation and setup
- Tesseract GitHub— Tesseract OCR engine
En lien sur TokRepo
Fil de discussion
Actifs similaires
Planka — Self-Hosted Trello-Like Project Management
Planka is a real-time, self-hosted kanban board for project management with drag-and-drop cards, file attachments, and multi-user collaboration.
Shiori — Simple Self-Hosted Bookmark Manager
Shiori is a lightweight self-hosted bookmark manager written in Go with full-text search, archiving, and a clean web interface for organizing your saved links.
Wallabag — Self-Hosted Read-It-Later App
Wallabag is a self-hosted read-it-later application that saves web articles for offline reading with tagging, annotations, and full-text search.
Cachet — Open Source Self-Hosted Status Page System
Cachet is a self-hosted status page application that helps teams communicate service availability and incidents to users through a clean web dashboard.