# Spark NLP — Scalable Natural Language Processing for Apache Spark

> A production-grade NLP library built on Apache Spark that provides tokenization, NER, classification, and transformer-based inference at cluster scale.

## Install

Save in your project root:

# Spark NLP — Scalable Natural Language Processing for Apache Spark

## Quick Use
```bash
pip install spark-nlp
```
```python
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
spark = sparknlp.start()
pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")
result = pipeline.annotate("Google was founded by Larry Page in California.")
print(result["entities"])
# ['Google', 'Larry Page', 'California']
```

## Introduction
Spark NLP is a natural language processing library built natively on Apache Spark and Spark ML. It enables NLP pipelines — tokenization, NER, sentiment analysis, text classification, and transformer inference — to run distributed across a cluster, handling datasets that single-machine NLP libraries cannot process efficiently.

## What Spark NLP Does
- Provides 50+ NLP annotators including tokenizer, stemmer, lemmatizer, NER, and POS tagger
- Runs BERT, RoBERTa, DeBERTa, and other transformer models inside Spark pipelines
- Scales NLP processing across Spark clusters for terabyte-scale text corpora
- Supports ONNX model import for running custom-trained models at scale
- Offers pre-trained pipelines and models for 200+ languages

## Architecture Overview
Spark NLP annotators extend Spark ML's Estimator and Transformer interfaces, making them composable in standard Spark ML pipelines. Each annotator reads annotation columns and produces new ones. Transformer-based annotators load ONNX or TensorFlow SavedModel weights and run inference using a JVM-native runtime, distributing batches across Spark executors. The library avoids Python UDFs to maintain native Spark performance.

## Self-Hosting & Configuration
- Install via pip and start a Spark session with sparknlp.start()
- Alternatively add the Maven package to an existing Spark cluster configuration
- Download pre-trained models from the John Snow Labs model hub
- Configure GPU inference by setting spark.jars.packages with the GPU variant
- Tune batch sizes and partition counts for optimal throughput on your cluster

## Key Features
- Native Spark ML integration means no data serialization between Python and JVM
- Pre-trained models cover 200+ languages including clinical and legal domains
- ONNX runtime support enables importing models trained in PyTorch or TensorFlow
- Healthcare and legal NLP editions provide domain-specific entity recognition
- Runs on Databricks, EMR, Dataproc, and any Spark environment without modification

## Comparison with Similar Tools
- **spaCy** — single-machine NLP with fast inference; Spark NLP distributes across clusters
- **Hugging Face Transformers** — Python-native transformer library; Spark NLP runs transformers inside Spark pipelines
- **Stanza** — Stanford's neural NLP library; Spark NLP scales to distributed datasets
- **Flair** — PyTorch NLP framework; Spark NLP provides native Spark integration without Python UDFs
- **NLTK** — educational NLP toolkit; Spark NLP is production-focused with distributed computing support

## FAQ
**Q: Does Spark NLP require a Spark cluster?**
A: No. It works in local mode with sparknlp.start() for development, and scales to clusters for production workloads.

**Q: Can I use GPU acceleration?**
A: Yes. Install the GPU variant and configure Spark to use GPU resources for transformer inference.

**Q: How many languages are supported?**
A: Over 200 languages with pre-trained models, including specialized models for healthcare and legal text.

**Q: Is it compatible with Databricks?**
A: Yes. Spark NLP runs on Databricks, AWS EMR, Google Dataproc, and any standard Spark environment.

## Sources
- https://github.com/JohnSnowLabs/spark-nlp
- https://sparknlp.org/

---
Source: https://tokrepo.com/en/workflows/asset-f5ccd2c7
Author: AI Open Source