Scripts2026年5月23日·1 分钟阅读

doccano — Open-Source Text Annotation Tool for Machine Learning

A web-based annotation platform for creating labeled datasets for NLP tasks including text classification, sequence labeling, and sequence-to-sequence problems.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
doccano Overview
通用 CLI 安装命令
npx tokrepo install 51dcb118-563e-11f1-9bc6-00163e2b0d79

Introduction

doccano is a self-hosted annotation tool for building labeled datasets for natural language processing. It supports text classification (sentiment, topic), sequence labeling (NER, POS tagging), and sequence-to-sequence tasks (translation, summarization). Teams can collaborate on annotation projects with built-in user management and inter-annotator agreement metrics.

What doccano Does

  • Annotates text documents for classification, named entity recognition, and seq-to-seq tasks
  • Supports multi-label and multi-class annotation with customizable label sets
  • Provides keyboard shortcuts for fast annotation workflows
  • Imports data from JSON, JSONL, CSV, TSV, and CoNLL formats
  • Exports labeled datasets in formats compatible with spaCy, Hugging Face, and other ML frameworks

Architecture Overview

doccano is a Python application built with Django on the backend and Vue.js on the frontend. It uses PostgreSQL (or SQLite for small deployments) for storing projects, documents, and annotations. The application runs as a single process with Celery for background tasks like data import and export. The REST API enables programmatic access to all annotation operations.

Self-Hosting & Configuration

  • Install via pip or run with Docker: docker compose up
  • SQLite works for evaluation; use PostgreSQL for production multi-user setups
  • Configure authentication backends including LDAP and social login providers
  • Set up role-based access control with admin, annotator, and reviewer roles
  • Back up the database and media directory for data persistence

Key Features

  • Three annotation modes: text classification, sequence labeling, and seq-to-seq
  • Auto-labeling integration for pre-annotating documents with ML models
  • Inter-annotator agreement metrics to measure label consistency across team members
  • REST API for programmatic project creation, data upload, and annotation retrieval
  • Collaborative features with user assignment, annotation review, and commenting

Comparison with Similar Tools

  • Label Studio — supports more data types (images, audio, video); doccano focuses exclusively on text
  • Prodigy — commercial tool by the spaCy team; doccano is free and open source
  • CVAT — specializes in computer vision annotation; doccano handles text-only tasks
  • Argilla — newer tool with tighter Hugging Face integration; doccano has a simpler setup

FAQ

Q: Can doccano handle multiple annotators on the same dataset? A: Yes. You can assign documents to specific annotators and measure inter-annotator agreement to identify labeling inconsistencies.

Q: Does doccano support pre-annotation? A: Yes. The auto-labeling feature lets you connect ML models to generate initial annotations that human annotators can then correct.

Q: What export formats are available? A: doccano exports in JSONL, CSV, and CoNLL formats. The JSONL format is directly compatible with spaCy and Hugging Face datasets.

Q: How does doccano compare to commercial annotation platforms? A: doccano covers the core annotation workflow well for small to medium teams. Commercial platforms may offer more advanced features like active learning and workforce management.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产