# fastText — Efficient Text Classification and Embeddings by Meta

> Library for efficient learning of word representations and text classification that trains on billions of words in minutes.

## Install

Save in your project root:

# fastText — Efficient Text Classification and Embeddings by Meta

## Quick Use
```bash
pip install fasttext
# Train a text classifier
fasttext supervised -input train.txt -output model
# Query predictions
fasttext predict model.bin test.txt
```

## Introduction
fastText is a library from Meta AI Research for efficient text classification and word representation learning. It extends the Word2Vec approach with subword information, enabling it to generate embeddings for out-of-vocabulary words and train classifiers on large datasets in seconds rather than hours.

## What fastText Does
- Learns word vectors using subword (character n-gram) information for robust embeddings
- Trains supervised text classifiers that scale to billions of examples
- Provides pre-trained word vectors for 157 languages
- Supports both CBOW and Skip-gram training objectives
- Offers quantization to compress models by 10x with minimal accuracy loss

## Architecture Overview
fastText represents each word as a bag of character n-grams plus the word itself. During training, it learns embeddings for these subword units and composes word vectors by summing them. For classification, it uses a shallow neural network with a linear classifier on top of averaged word embeddings, achieving accuracy competitive with deep models at a fraction of the compute cost. The hierarchical softmax option further speeds up training on datasets with many labels.

## Self-Hosting & Configuration
- Install via pip, conda, or compile from source for C++ CLI tools
- Pre-trained vectors available for download from the fastText website
- Training parameters (learning rate, epochs, n-grams) are set via CLI flags
- Use quantize to reduce model size for deployment on resource-constrained systems
- The Python API wraps the C++ core for easy integration into data pipelines

## Key Features
- Subword embeddings handle misspellings, morphology, and rare words gracefully
- Training speed: classifies millions of examples per second on a single CPU core
- Pre-trained vectors for 157 languages trained on Common Crawl and Wikipedia
- Automatic hyperparameter tuning via the autotune feature
- Model compression through product quantization for mobile and edge deployment

## Comparison with Similar Tools
- **Word2Vec** — pioneered word embeddings but lacks subword information; fastText handles OOV words naturally
- **GloVe** — global co-occurrence matrix approach; fastText is faster to train and supports subword units
- **spaCy** — full NLP pipeline with built-in vectors; fastText focuses purely on embeddings and classification
- **Sentence Transformers** — produces contextual sentence embeddings via Transformers; fastText is simpler and faster
- **scikit-learn text classifiers** — flexible but slower on large datasets; fastText is optimized for scale

## FAQ
**Q: Can fastText handle languages with rich morphology?**
A: Yes. Subword n-grams capture morphological patterns, making it effective for agglutinative languages like Finnish, Turkish, and Korean.

**Q: How does fastText compare to Transformer-based embeddings?**
A: Transformer models produce contextual embeddings and generally achieve higher accuracy on benchmarks, but fastText is orders of magnitude faster and works well when compute or latency budgets are tight.

**Q: What format does the training data need?**
A: For supervised classification, each line should contain a label prefixed with __label__ followed by the text. For unsupervised training, plain text with one sentence per line.

**Q: Is fastText suitable for production use?**
A: Yes. The C++ core is fast and memory-efficient. Quantized models can run on mobile devices, and the library has been deployed at scale inside Meta.

## Sources
- https://github.com/facebookresearch/fastText
- https://fasttext.cc

---
Source: https://tokrepo.com/en/workflows/a51a9e2e-43c5-11f1-9bc6-00163e2b0d79
Author: AI Open Source