# Spark NLP — Scalable Natural Language Processing for Apache Spark > A production-grade NLP library built on Apache Spark that provides tokenization, NER, classification, and transformer-based inference at cluster scale. ## Install Save in your project root: # Spark NLP — Scalable Natural Language Processing for Apache Spark ## Quick Use ```bash pip install spark-nlp ``` ```python import sparknlp from sparknlp.pretrained import PretrainedPipeline spark = sparknlp.start() pipeline = PretrainedPipeline("recognize_entities_dl", lang="en") result = pipeline.annotate("Google was founded by Larry Page in California.") print(result["entities"]) # ['Google', 'Larry Page', 'California'] ``` ## Introduction Spark NLP is a natural language processing library built natively on Apache Spark and Spark ML. It enables NLP pipelines — tokenization, NER, sentiment analysis, text classification, and transformer inference — to run distributed across a cluster, handling datasets that single-machine NLP libraries cannot process efficiently. ## What Spark NLP Does - Provides 50+ NLP annotators including tokenizer, stemmer, lemmatizer, NER, and POS tagger - Runs BERT, RoBERTa, DeBERTa, and other transformer models inside Spark pipelines - Scales NLP processing across Spark clusters for terabyte-scale text corpora - Supports ONNX model import for running custom-trained models at scale - Offers pre-trained pipelines and models for 200+ languages ## Architecture Overview Spark NLP annotators extend Spark ML's Estimator and Transformer interfaces, making them composable in standard Spark ML pipelines. Each annotator reads annotation columns and produces new ones. Transformer-based annotators load ONNX or TensorFlow SavedModel weights and run inference using a JVM-native runtime, distributing batches across Spark executors. The library avoids Python UDFs to maintain native Spark performance. ## Self-Hosting & Configuration - Install via pip and start a Spark session with sparknlp.start() - Alternatively add the Maven package to an existing Spark cluster configuration - Download pre-trained models from the John Snow Labs model hub - Configure GPU inference by setting spark.jars.packages with the GPU variant - Tune batch sizes and partition counts for optimal throughput on your cluster ## Key Features - Native Spark ML integration means no data serialization between Python and JVM - Pre-trained models cover 200+ languages including clinical and legal domains - ONNX runtime support enables importing models trained in PyTorch or TensorFlow - Healthcare and legal NLP editions provide domain-specific entity recognition - Runs on Databricks, EMR, Dataproc, and any Spark environment without modification ## Comparison with Similar Tools - **spaCy** — single-machine NLP with fast inference; Spark NLP distributes across clusters - **Hugging Face Transformers** — Python-native transformer library; Spark NLP runs transformers inside Spark pipelines - **Stanza** — Stanford's neural NLP library; Spark NLP scales to distributed datasets - **Flair** — PyTorch NLP framework; Spark NLP provides native Spark integration without Python UDFs - **NLTK** — educational NLP toolkit; Spark NLP is production-focused with distributed computing support ## FAQ **Q: Does Spark NLP require a Spark cluster?** A: No. It works in local mode with sparknlp.start() for development, and scales to clusters for production workloads. **Q: Can I use GPU acceleration?** A: Yes. Install the GPU variant and configure Spark to use GPU resources for transformer inference. **Q: How many languages are supported?** A: Over 200 languages with pre-trained models, including specialized models for healthcare and legal text. **Q: Is it compatible with Databricks?** A: Yes. Spark NLP runs on Databricks, AWS EMR, Google Dataproc, and any standard Spark environment. ## Sources - https://github.com/JohnSnowLabs/spark-nlp - https://sparknlp.org/ --- Source: https://tokrepo.com/en/workflows/asset-f5ccd2c7 Author: AI Open Source