Introduction
Spark NLP is a natural language processing library built natively on Apache Spark and Spark ML. It enables NLP pipelines — tokenization, NER, sentiment analysis, text classification, and transformer inference — to run distributed across a cluster, handling datasets that single-machine NLP libraries cannot process efficiently.
What Spark NLP Does
- Provides 50+ NLP annotators including tokenizer, stemmer, lemmatizer, NER, and POS tagger
- Runs BERT, RoBERTa, DeBERTa, and other transformer models inside Spark pipelines
- Scales NLP processing across Spark clusters for terabyte-scale text corpora
- Supports ONNX model import for running custom-trained models at scale
- Offers pre-trained pipelines and models for 200+ languages
Architecture Overview
Spark NLP annotators extend Spark ML's Estimator and Transformer interfaces, making them composable in standard Spark ML pipelines. Each annotator reads annotation columns and produces new ones. Transformer-based annotators load ONNX or TensorFlow SavedModel weights and run inference using a JVM-native runtime, distributing batches across Spark executors. The library avoids Python UDFs to maintain native Spark performance.
Self-Hosting & Configuration
- Install via pip and start a Spark session with sparknlp.start()
- Alternatively add the Maven package to an existing Spark cluster configuration
- Download pre-trained models from the John Snow Labs model hub
- Configure GPU inference by setting spark.jars.packages with the GPU variant
- Tune batch sizes and partition counts for optimal throughput on your cluster
Key Features
- Native Spark ML integration means no data serialization between Python and JVM
- Pre-trained models cover 200+ languages including clinical and legal domains
- ONNX runtime support enables importing models trained in PyTorch or TensorFlow
- Healthcare and legal NLP editions provide domain-specific entity recognition
- Runs on Databricks, EMR, Dataproc, and any Spark environment without modification
Comparison with Similar Tools
- spaCy — single-machine NLP with fast inference; Spark NLP distributes across clusters
- Hugging Face Transformers — Python-native transformer library; Spark NLP runs transformers inside Spark pipelines
- Stanza — Stanford's neural NLP library; Spark NLP scales to distributed datasets
- Flair — PyTorch NLP framework; Spark NLP provides native Spark integration without Python UDFs
- NLTK — educational NLP toolkit; Spark NLP is production-focused with distributed computing support
FAQ
Q: Does Spark NLP require a Spark cluster? A: No. It works in local mode with sparknlp.start() for development, and scales to clusters for production workloads.
Q: Can I use GPU acceleration? A: Yes. Install the GPU variant and configure Spark to use GPU resources for transformer inference.
Q: How many languages are supported? A: Over 200 languages with pre-trained models, including specialized models for healthcare and legal text.
Q: Is it compatible with Databricks? A: Yes. Spark NLP runs on Databricks, AWS EMR, Google Dataproc, and any standard Spark environment.