# fastText — Efficient Text Classification and Embeddings by Meta > Library for efficient learning of word representations and text classification that trains on billions of words in minutes. ## Install Save in your project root: # fastText — Efficient Text Classification and Embeddings by Meta ## Quick Use ```bash pip install fasttext # Train a text classifier fasttext supervised -input train.txt -output model # Query predictions fasttext predict model.bin test.txt ``` ## Introduction fastText is a library from Meta AI Research for efficient text classification and word representation learning. It extends the Word2Vec approach with subword information, enabling it to generate embeddings for out-of-vocabulary words and train classifiers on large datasets in seconds rather than hours. ## What fastText Does - Learns word vectors using subword (character n-gram) information for robust embeddings - Trains supervised text classifiers that scale to billions of examples - Provides pre-trained word vectors for 157 languages - Supports both CBOW and Skip-gram training objectives - Offers quantization to compress models by 10x with minimal accuracy loss ## Architecture Overview fastText represents each word as a bag of character n-grams plus the word itself. During training, it learns embeddings for these subword units and composes word vectors by summing them. For classification, it uses a shallow neural network with a linear classifier on top of averaged word embeddings, achieving accuracy competitive with deep models at a fraction of the compute cost. The hierarchical softmax option further speeds up training on datasets with many labels. ## Self-Hosting & Configuration - Install via pip, conda, or compile from source for C++ CLI tools - Pre-trained vectors available for download from the fastText website - Training parameters (learning rate, epochs, n-grams) are set via CLI flags - Use quantize to reduce model size for deployment on resource-constrained systems - The Python API wraps the C++ core for easy integration into data pipelines ## Key Features - Subword embeddings handle misspellings, morphology, and rare words gracefully - Training speed: classifies millions of examples per second on a single CPU core - Pre-trained vectors for 157 languages trained on Common Crawl and Wikipedia - Automatic hyperparameter tuning via the autotune feature - Model compression through product quantization for mobile and edge deployment ## Comparison with Similar Tools - **Word2Vec** — pioneered word embeddings but lacks subword information; fastText handles OOV words naturally - **GloVe** — global co-occurrence matrix approach; fastText is faster to train and supports subword units - **spaCy** — full NLP pipeline with built-in vectors; fastText focuses purely on embeddings and classification - **Sentence Transformers** — produces contextual sentence embeddings via Transformers; fastText is simpler and faster - **scikit-learn text classifiers** — flexible but slower on large datasets; fastText is optimized for scale ## FAQ **Q: Can fastText handle languages with rich morphology?** A: Yes. Subword n-grams capture morphological patterns, making it effective for agglutinative languages like Finnish, Turkish, and Korean. **Q: How does fastText compare to Transformer-based embeddings?** A: Transformer models produce contextual embeddings and generally achieve higher accuracy on benchmarks, but fastText is orders of magnitude faster and works well when compute or latency budgets are tight. **Q: What format does the training data need?** A: For supervised classification, each line should contain a label prefixed with __label__ followed by the text. For unsupervised training, plain text with one sentence per line. **Q: Is fastText suitable for production use?** A: Yes. The C++ core is fast and memory-efficient. Quantized models can run on mobile devices, and the library has been deployed at scale inside Meta. ## Sources - https://github.com/facebookresearch/fastText - https://fasttext.cc --- Source: https://tokrepo.com/en/workflows/a51a9e2e-43c5-11f1-9bc6-00163e2b0d79 Author: AI Open Source