fastText — Efficient Text Classification and Embeddings by Meta

Introduction

fastText is a library from Meta AI Research for efficient text classification and word representation learning. It extends the Word2Vec approach with subword information, enabling it to generate embeddings for out-of-vocabulary words and train classifiers on large datasets in seconds rather than hours.

What fastText Does

Learns word vectors using subword (character n-gram) information for robust embeddings
Trains supervised text classifiers that scale to billions of examples
Provides pre-trained word vectors for 157 languages
Supports both CBOW and Skip-gram training objectives
Offers quantization to compress models by 10x with minimal accuracy loss

Architecture Overview

fastText represents each word as a bag of character n-grams plus the word itself. During training, it learns embeddings for these subword units and composes word vectors by summing them. For classification, it uses a shallow neural network with a linear classifier on top of averaged word embeddings, achieving accuracy competitive with deep models at a fraction of the compute cost. The hierarchical softmax option further speeds up training on datasets with many labels.

Self-Hosting & Configuration

Install via pip, conda, or compile from source for C++ CLI tools
Pre-trained vectors available for download from the fastText website
Training parameters (learning rate, epochs, n-grams) are set via CLI flags
Use quantize to reduce model size for deployment on resource-constrained systems
The Python API wraps the C++ core for easy integration into data pipelines

Key Features

Subword embeddings handle misspellings, morphology, and rare words gracefully
Training speed: classifies millions of examples per second on a single CPU core
Pre-trained vectors for 157 languages trained on Common Crawl and Wikipedia
Automatic hyperparameter tuning via the autotune feature
Model compression through product quantization for mobile and edge deployment

Comparison with Similar Tools

Word2Vec — pioneered word embeddings but lacks subword information; fastText handles OOV words naturally
GloVe — global co-occurrence matrix approach; fastText is faster to train and supports subword units
spaCy — full NLP pipeline with built-in vectors; fastText focuses purely on embeddings and classification
Sentence Transformers — produces contextual sentence embeddings via Transformers; fastText is simpler and faster
scikit-learn text classifiers — flexible but slower on large datasets; fastText is optimized for scale

FAQ

Q: Can fastText handle languages with rich morphology? A: Yes. Subword n-grams capture morphological patterns, making it effective for agglutinative languages like Finnish, Turkish, and Korean.

Q: How does fastText compare to Transformer-based embeddings? A: Transformer models produce contextual embeddings and generally achieve higher accuracy on benchmarks, but fastText is orders of magnitude faster and works well when compute or latency budgets are tight.

Q: What format does the training data need? A: For supervised classification, each line should contain a label prefixed with label followed by the text. For unsupervised training, plain text with one sentence per line.

Q: Is fastText suitable for production use? A: Yes. The C++ core is fast and memory-efficient. Quantized models can run on mobile devices, and the library has been deployed at scale inside Meta.

fastText — Efficient Text Classification and Embeddings by Meta

Introduction

What fastText Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

CatBoost — Gradient Boosting with Native Categorical Support

Modin — Parallel pandas with One Line of Code

Pillow — The Python Imaging Library Fork