Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 24, 2026·3 min de lecture

SentencePiece — Language-Independent Subword Tokenizer

An unsupervised text tokenizer and detokenizer by Google that implements BPE and unigram language model algorithms, used as the tokenization backbone for many large language models.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
SentencePiece Overview
Commande CLI universelle
npx tokrepo install b92244fc-578c-11f1-9bc6-00163e2b0d79

Introduction

SentencePiece treats text as a raw byte stream rather than pre-tokenized words, making it truly language-independent. It trains subword vocabularies using BPE or unigram language model algorithms directly from raw text, which is why it became the default tokenizer for T5, LLaMA, Gemma, and many other LLMs.

What SentencePiece Does

  • Trains subword tokenization models from raw text without pre-tokenization or language rules
  • Implements both Byte-Pair Encoding (BPE) and unigram language model algorithms
  • Encodes and decodes text reversibly with no information loss
  • Operates on Unicode and raw bytes, supporting all languages including CJK without word boundaries
  • Ships as a C++ library with Python, Java, and TensorFlow bindings

Architecture Overview

SentencePiece models are trained offline and serialized as protocol buffer files. At runtime, the encoder applies the learned merge rules (BPE) or maximum likelihood segmentation (unigram) to split input text into subword units. A sentencepiece model maps each subword to an integer ID. The decoder reverses this mapping losslessly using stored whitespace metadata.

Self-Hosting & Configuration

  • Install Python bindings: pip install sentencepiece
  • Build from source for C++ embedding: cmake and make in the repository
  • Train models with configurable vocab_size, model_type (bpe/unigram), and character_coverage
  • Control special tokens: bos_id, eos_id, pad_id, unk_id
  • Use byte_fallback=true to handle any UTF-8 character even outside the vocabulary

Key Features

  • Language-agnostic: no word segmentation or normalization rules required
  • Fully reversible: decode(encode(text)) == text with whitespace preserved
  • Byte-fallback mode handles out-of-vocabulary characters without UNK tokens
  • Fast C++ core with sub-millisecond encoding for typical sentences
  • Compatible with all major ML frameworks including PyTorch, TensorFlow, and JAX

Comparison with Similar Tools

  • Hugging Face Tokenizers — Rust-based with more tokenizer types and faster parallel training; SentencePiece is the reference implementation used by most LLM papers
  • tiktoken — OpenAI's BPE tokenizer optimized for GPT models; SentencePiece supports unigram model and broader language coverage
  • WordPiece — BERT's algorithm that requires pre-tokenization; SentencePiece works on raw text
  • BPE (original) — the algorithm itself; SentencePiece wraps it with training, serialization, and multi-language support

FAQ

Q: What is the difference between BPE and unigram mode? A: BPE greedily merges frequent pairs bottom-up. Unigram starts with a large vocabulary and prunes by likelihood. Unigram often produces better subword regularization for training robustness.

Q: Why do LLMs use SentencePiece instead of word-level tokenization? A: Subword tokenization handles rare words, morphology, and multilingual text with a fixed vocabulary size, avoiding out-of-vocabulary issues.

Q: Can I add custom tokens to an existing model? A: Yes. Use the user_defined_symbols parameter during training, or modify the model proto to add tokens post-hoc.

Q: How large should my vocabulary be? A: 32,000 is common for LLMs. Smaller vocabularies (8k-16k) work for single-language applications; larger (64k-128k) help multilingual models.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires