Introduction
SentencePiece treats text as a raw byte stream rather than pre-tokenized words, making it truly language-independent. It trains subword vocabularies using BPE or unigram language model algorithms directly from raw text, which is why it became the default tokenizer for T5, LLaMA, Gemma, and many other LLMs.
What SentencePiece Does
- Trains subword tokenization models from raw text without pre-tokenization or language rules
- Implements both Byte-Pair Encoding (BPE) and unigram language model algorithms
- Encodes and decodes text reversibly with no information loss
- Operates on Unicode and raw bytes, supporting all languages including CJK without word boundaries
- Ships as a C++ library with Python, Java, and TensorFlow bindings
Architecture Overview
SentencePiece models are trained offline and serialized as protocol buffer files. At runtime, the encoder applies the learned merge rules (BPE) or maximum likelihood segmentation (unigram) to split input text into subword units. A sentencepiece model maps each subword to an integer ID. The decoder reverses this mapping losslessly using stored whitespace metadata.
Self-Hosting & Configuration
- Install Python bindings: pip install sentencepiece
- Build from source for C++ embedding: cmake and make in the repository
- Train models with configurable vocab_size, model_type (bpe/unigram), and character_coverage
- Control special tokens: bos_id, eos_id, pad_id, unk_id
- Use byte_fallback=true to handle any UTF-8 character even outside the vocabulary
Key Features
- Language-agnostic: no word segmentation or normalization rules required
- Fully reversible: decode(encode(text)) == text with whitespace preserved
- Byte-fallback mode handles out-of-vocabulary characters without UNK tokens
- Fast C++ core with sub-millisecond encoding for typical sentences
- Compatible with all major ML frameworks including PyTorch, TensorFlow, and JAX
Comparison with Similar Tools
- Hugging Face Tokenizers — Rust-based with more tokenizer types and faster parallel training; SentencePiece is the reference implementation used by most LLM papers
- tiktoken — OpenAI's BPE tokenizer optimized for GPT models; SentencePiece supports unigram model and broader language coverage
- WordPiece — BERT's algorithm that requires pre-tokenization; SentencePiece works on raw text
- BPE (original) — the algorithm itself; SentencePiece wraps it with training, serialization, and multi-language support
FAQ
Q: What is the difference between BPE and unigram mode? A: BPE greedily merges frequent pairs bottom-up. Unigram starts with a large vocabulary and prunes by likelihood. Unigram often produces better subword regularization for training robustness.
Q: Why do LLMs use SentencePiece instead of word-level tokenization? A: Subword tokenization handles rare words, morphology, and multilingual text with a fixed vocabulary size, avoiding out-of-vocabulary issues.
Q: Can I add custom tokens to an existing model? A: Yes. Use the user_defined_symbols parameter during training, or modify the model proto to add tokens post-hoc.
Q: How large should my vocabulary be? A: 32,000 is common for LLMs. Smaller vocabularies (8k-16k) work for single-language applications; larger (64k-128k) help multilingual models.