Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 24, 2026·3 min de lectura

SentencePiece — Language-Independent Subword Tokenizer

An unsupervised text tokenizer and detokenizer by Google that implements BPE and unigram language model algorithms, used as the tokenization backbone for many large language models.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
SentencePiece Overview
Comando CLI universal
npx tokrepo install b92244fc-578c-11f1-9bc6-00163e2b0d79

Introduction

SentencePiece treats text as a raw byte stream rather than pre-tokenized words, making it truly language-independent. It trains subword vocabularies using BPE or unigram language model algorithms directly from raw text, which is why it became the default tokenizer for T5, LLaMA, Gemma, and many other LLMs.

What SentencePiece Does

  • Trains subword tokenization models from raw text without pre-tokenization or language rules
  • Implements both Byte-Pair Encoding (BPE) and unigram language model algorithms
  • Encodes and decodes text reversibly with no information loss
  • Operates on Unicode and raw bytes, supporting all languages including CJK without word boundaries
  • Ships as a C++ library with Python, Java, and TensorFlow bindings

Architecture Overview

SentencePiece models are trained offline and serialized as protocol buffer files. At runtime, the encoder applies the learned merge rules (BPE) or maximum likelihood segmentation (unigram) to split input text into subword units. A sentencepiece model maps each subword to an integer ID. The decoder reverses this mapping losslessly using stored whitespace metadata.

Self-Hosting & Configuration

  • Install Python bindings: pip install sentencepiece
  • Build from source for C++ embedding: cmake and make in the repository
  • Train models with configurable vocab_size, model_type (bpe/unigram), and character_coverage
  • Control special tokens: bos_id, eos_id, pad_id, unk_id
  • Use byte_fallback=true to handle any UTF-8 character even outside the vocabulary

Key Features

  • Language-agnostic: no word segmentation or normalization rules required
  • Fully reversible: decode(encode(text)) == text with whitespace preserved
  • Byte-fallback mode handles out-of-vocabulary characters without UNK tokens
  • Fast C++ core with sub-millisecond encoding for typical sentences
  • Compatible with all major ML frameworks including PyTorch, TensorFlow, and JAX

Comparison with Similar Tools

  • Hugging Face Tokenizers — Rust-based with more tokenizer types and faster parallel training; SentencePiece is the reference implementation used by most LLM papers
  • tiktoken — OpenAI's BPE tokenizer optimized for GPT models; SentencePiece supports unigram model and broader language coverage
  • WordPiece — BERT's algorithm that requires pre-tokenization; SentencePiece works on raw text
  • BPE (original) — the algorithm itself; SentencePiece wraps it with training, serialization, and multi-language support

FAQ

Q: What is the difference between BPE and unigram mode? A: BPE greedily merges frequent pairs bottom-up. Unigram starts with a large vocabulary and prunes by likelihood. Unigram often produces better subword regularization for training robustness.

Q: Why do LLMs use SentencePiece instead of word-level tokenization? A: Subword tokenization handles rare words, morphology, and multilingual text with a fixed vocabulary size, avoiding out-of-vocabulary issues.

Q: Can I add custom tokens to an existing model? A: Yes. Use the user_defined_symbols parameter during training, or modify the model proto to add tokens post-hoc.

Q: How large should my vocabulary be? A: 32,000 is common for LLMs. Smaller vocabularies (8k-16k) work for single-language applications; larger (64k-128k) help multilingual models.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados