Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsApr 29, 2026·3 min de lectura

fastText — Efficient Text Classification and Embeddings by Meta

Library for efficient learning of word representations and text classification that trains on billions of words in minutes.

Introduction

fastText is a library from Meta AI Research for efficient text classification and word representation learning. It extends the Word2Vec approach with subword information, enabling it to generate embeddings for out-of-vocabulary words and train classifiers on large datasets in seconds rather than hours.

What fastText Does

  • Learns word vectors using subword (character n-gram) information for robust embeddings
  • Trains supervised text classifiers that scale to billions of examples
  • Provides pre-trained word vectors for 157 languages
  • Supports both CBOW and Skip-gram training objectives
  • Offers quantization to compress models by 10x with minimal accuracy loss

Architecture Overview

fastText represents each word as a bag of character n-grams plus the word itself. During training, it learns embeddings for these subword units and composes word vectors by summing them. For classification, it uses a shallow neural network with a linear classifier on top of averaged word embeddings, achieving accuracy competitive with deep models at a fraction of the compute cost. The hierarchical softmax option further speeds up training on datasets with many labels.

Self-Hosting & Configuration

  • Install via pip, conda, or compile from source for C++ CLI tools
  • Pre-trained vectors available for download from the fastText website
  • Training parameters (learning rate, epochs, n-grams) are set via CLI flags
  • Use quantize to reduce model size for deployment on resource-constrained systems
  • The Python API wraps the C++ core for easy integration into data pipelines

Key Features

  • Subword embeddings handle misspellings, morphology, and rare words gracefully
  • Training speed: classifies millions of examples per second on a single CPU core
  • Pre-trained vectors for 157 languages trained on Common Crawl and Wikipedia
  • Automatic hyperparameter tuning via the autotune feature
  • Model compression through product quantization for mobile and edge deployment

Comparison with Similar Tools

  • Word2Vec — pioneered word embeddings but lacks subword information; fastText handles OOV words naturally
  • GloVe — global co-occurrence matrix approach; fastText is faster to train and supports subword units
  • spaCy — full NLP pipeline with built-in vectors; fastText focuses purely on embeddings and classification
  • Sentence Transformers — produces contextual sentence embeddings via Transformers; fastText is simpler and faster
  • scikit-learn text classifiers — flexible but slower on large datasets; fastText is optimized for scale

FAQ

Q: Can fastText handle languages with rich morphology? A: Yes. Subword n-grams capture morphological patterns, making it effective for agglutinative languages like Finnish, Turkish, and Korean.

Q: How does fastText compare to Transformer-based embeddings? A: Transformer models produce contextual embeddings and generally achieve higher accuracy on benchmarks, but fastText is orders of magnitude faster and works well when compute or latency budgets are tight.

Q: What format does the training data need? A: For supervised classification, each line should contain a label prefixed with label followed by the text. For unsupervised training, plain text with one sentence per line.

Q: Is fastText suitable for production use? A: Yes. The C++ core is fast and memory-efficient. Quantized models can run on mobile devices, and the library has been deployed at scale inside Meta.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados