# doccano — Open-Source Text Annotation Tool for Machine Learning

> A web-based annotation platform for creating labeled datasets for NLP tasks including text classification, sequence labeling, and sequence-to-sequence problems.

## Install

Save as a script file and run:

# doccano — Open-Source Text Annotation Tool for Machine Learning

## Quick Use
```bash
pip install doccano
doccano init
doccano createuser --username admin --password changeme
doccano webserver --port 8000
# Open http://localhost:8000
```

## Introduction
doccano is a self-hosted annotation tool for building labeled datasets for natural language processing. It supports text classification (sentiment, topic), sequence labeling (NER, POS tagging), and sequence-to-sequence tasks (translation, summarization). Teams can collaborate on annotation projects with built-in user management and inter-annotator agreement metrics.

## What doccano Does
- Annotates text documents for classification, named entity recognition, and seq-to-seq tasks
- Supports multi-label and multi-class annotation with customizable label sets
- Provides keyboard shortcuts for fast annotation workflows
- Imports data from JSON, JSONL, CSV, TSV, and CoNLL formats
- Exports labeled datasets in formats compatible with spaCy, Hugging Face, and other ML frameworks

## Architecture Overview
doccano is a Python application built with Django on the backend and Vue.js on the frontend. It uses PostgreSQL (or SQLite for small deployments) for storing projects, documents, and annotations. The application runs as a single process with Celery for background tasks like data import and export. The REST API enables programmatic access to all annotation operations.

## Self-Hosting & Configuration
- Install via pip or run with Docker: `docker compose up`
- SQLite works for evaluation; use PostgreSQL for production multi-user setups
- Configure authentication backends including LDAP and social login providers
- Set up role-based access control with admin, annotator, and reviewer roles
- Back up the database and media directory for data persistence

## Key Features
- Three annotation modes: text classification, sequence labeling, and seq-to-seq
- Auto-labeling integration for pre-annotating documents with ML models
- Inter-annotator agreement metrics to measure label consistency across team members
- REST API for programmatic project creation, data upload, and annotation retrieval
- Collaborative features with user assignment, annotation review, and commenting

## Comparison with Similar Tools
- **Label Studio** — supports more data types (images, audio, video); doccano focuses exclusively on text
- **Prodigy** — commercial tool by the spaCy team; doccano is free and open source
- **CVAT** — specializes in computer vision annotation; doccano handles text-only tasks
- **Argilla** — newer tool with tighter Hugging Face integration; doccano has a simpler setup

## FAQ
**Q: Can doccano handle multiple annotators on the same dataset?**
A: Yes. You can assign documents to specific annotators and measure inter-annotator agreement to identify labeling inconsistencies.

**Q: Does doccano support pre-annotation?**
A: Yes. The auto-labeling feature lets you connect ML models to generate initial annotations that human annotators can then correct.

**Q: What export formats are available?**
A: doccano exports in JSONL, CSV, and CoNLL formats. The JSONL format is directly compatible with spaCy and Hugging Face datasets.

**Q: How does doccano compare to commercial annotation platforms?**
A: doccano covers the core annotation workflow well for small to medium teams. Commercial platforms may offer more advanced features like active learning and workforce management.

## Sources
- https://github.com/doccano/doccano
- https://doccano.github.io/doccano

---
Source: https://tokrepo.com/en/workflows/asset-51dcb118
Author: Script Depot