# Hugging Face Tokenizers — Fast Text Tokenization for ML Pipelines

> Hugging Face Tokenizers is a Rust-powered tokenization library with Python bindings that implements BPE, WordPiece, Unigram, and SentencePiece tokenizers with training and encoding speeds of gigabytes per second, used as the backbone for Transformers model tokenization.

## Install

Save in your project root:

## Quick Use
```bash
pip install tokenizers
python -c "from tokenizers import Tokenizer; t = Tokenizer.from_pretrained('bert-base-uncased'); print(t.encode('Hello world').tokens)"
```

## Introduction
Hugging Face Tokenizers is a high-performance tokenization library written in Rust with bindings for Python, Node.js, and Ruby. It provides implementations of the most common subword tokenization algorithms and can train a new tokenizer on a large corpus in seconds, making it the foundation for tokenization in the Hugging Face ecosystem.

## What Hugging Face Tokenizers Does
- Implements BPE, WordPiece, Unigram, and SentencePiece tokenization algorithms
- Encodes text at gigabytes per second via the Rust core
- Trains new tokenizers from raw text corpora in seconds
- Handles pre-tokenization, normalization, post-processing, and decoding as a configurable pipeline
- Provides full alignment tracking between original text and token offsets

## Architecture Overview
The library is structured as a pipeline with five components: normalizer (Unicode normalization, lowercasing), pre-tokenizer (whitespace splitting, byte-level), model (BPE, WordPiece, or Unigram), post-processor (special tokens, template formatting), and decoder (converts tokens back to text). Each component is independently configurable. The Rust core handles the performance-critical encoding loop, with a thin Python wrapper via PyO3.

## Self-Hosting & Configuration
- Install with `pip install tokenizers` (pre-built wheels for Linux, macOS, Windows)
- Load pre-trained tokenizers from the Hugging Face Hub with `Tokenizer.from_pretrained()`
- Train a custom tokenizer with `BpeTrainer`, `WordPieceTrainer`, or `UnigramTrainer`
- Configure the full pipeline (normalizer, pre-tokenizer, model, post-processor) via the Python API
- Serialize trained tokenizers to JSON for portable deployment

## Key Features
- Rust-powered encoding at GB/s speeds, orders of magnitude faster than pure Python
- Full offset mapping preserves character-level alignment for NER and span extraction
- Batch encoding with parallelism for high-throughput data preprocessing
- Modular pipeline design allows mixing and matching components
- Seamless integration with Hugging Face Transformers AutoTokenizer

## Comparison with Similar Tools
- **SentencePiece** — C++ library by Google; similar algorithms but less modular and no offset tracking
- **tiktoken** — OpenAI BPE tokenizer; fast but limited to BPE and no training API
- **spaCy tokenizer** — rule-based word tokenizer; not designed for subword tokenization
- **NLTK tokenizers** — classic NLP tokenizers; word-level only, much slower
- **Transformers AutoTokenizer** — wraps this library and adds model-specific config loading

## FAQ
**Q: How fast is encoding compared to Python tokenizers?**
A: The Rust core processes text at 1-2 GB/s, typically 10-100x faster than equivalent pure Python implementations.

**Q: Can I train a tokenizer on my own corpus?**
A: Yes. Instantiate a trainer (e.g., `BpeTrainer`), configure vocab size and special tokens, and call `tokenizer.train()` with your text files.

**Q: Does it support batch encoding with parallelism?**
A: Yes. The `encode_batch()` method processes multiple inputs in parallel using Rust threads.

**Q: How do I use a trained tokenizer with Transformers?**
A: Save with `tokenizer.save()` and load in Transformers with `AutoTokenizer.from_pretrained(path)`.

## Sources
- https://github.com/huggingface/tokenizers
- https://huggingface.co/docs/tokenizers/

---
Source: https://tokrepo.com/en/workflows/2b346f1b-42ba-11f1-9bc6-00163e2b0d79
Author: AI Open Source