Introduction
Hugging Face Tokenizers is a high-performance tokenization library written in Rust with bindings for Python, Node.js, and Ruby. It provides implementations of the most common subword tokenization algorithms and can train a new tokenizer on a large corpus in seconds, making it the foundation for tokenization in the Hugging Face ecosystem.
What Hugging Face Tokenizers Does
- Implements BPE, WordPiece, Unigram, and SentencePiece tokenization algorithms
- Encodes text at gigabytes per second via the Rust core
- Trains new tokenizers from raw text corpora in seconds
- Handles pre-tokenization, normalization, post-processing, and decoding as a configurable pipeline
- Provides full alignment tracking between original text and token offsets
Architecture Overview
The library is structured as a pipeline with five components: normalizer (Unicode normalization, lowercasing), pre-tokenizer (whitespace splitting, byte-level), model (BPE, WordPiece, or Unigram), post-processor (special tokens, template formatting), and decoder (converts tokens back to text). Each component is independently configurable. The Rust core handles the performance-critical encoding loop, with a thin Python wrapper via PyO3.
Self-Hosting & Configuration
- Install with
pip install tokenizers(pre-built wheels for Linux, macOS, Windows) - Load pre-trained tokenizers from the Hugging Face Hub with
Tokenizer.from_pretrained() - Train a custom tokenizer with
BpeTrainer,WordPieceTrainer, orUnigramTrainer - Configure the full pipeline (normalizer, pre-tokenizer, model, post-processor) via the Python API
- Serialize trained tokenizers to JSON for portable deployment
Key Features
- Rust-powered encoding at GB/s speeds, orders of magnitude faster than pure Python
- Full offset mapping preserves character-level alignment for NER and span extraction
- Batch encoding with parallelism for high-throughput data preprocessing
- Modular pipeline design allows mixing and matching components
- Seamless integration with Hugging Face Transformers AutoTokenizer
Comparison with Similar Tools
- SentencePiece — C++ library by Google; similar algorithms but less modular and no offset tracking
- tiktoken — OpenAI BPE tokenizer; fast but limited to BPE and no training API
- spaCy tokenizer — rule-based word tokenizer; not designed for subword tokenization
- NLTK tokenizers — classic NLP tokenizers; word-level only, much slower
- Transformers AutoTokenizer — wraps this library and adds model-specific config loading
FAQ
Q: How fast is encoding compared to Python tokenizers? A: The Rust core processes text at 1-2 GB/s, typically 10-100x faster than equivalent pure Python implementations.
Q: Can I train a tokenizer on my own corpus?
A: Yes. Instantiate a trainer (e.g., BpeTrainer), configure vocab size and special tokens, and call tokenizer.train() with your text files.
Q: Does it support batch encoding with parallelism?
A: Yes. The encode_batch() method processes multiple inputs in parallel using Rust threads.
Q: How do I use a trained tokenizer with Transformers?
A: Save with tokenizer.save() and load in Transformers with AutoTokenizer.from_pretrained(path).