Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 31, 2026·3 min de lectura

Spark NLP — Scalable Natural Language Processing for Apache Spark

A production-grade NLP library built on Apache Spark that provides tokenization, NER, classification, and transformer-based inference at cluster scale.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Spark NLP
Comando de instalación directa
npx -y tokrepo@latest install f5ccd2c7-5cea-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

Spark NLP is a natural language processing library built natively on Apache Spark and Spark ML. It enables NLP pipelines — tokenization, NER, sentiment analysis, text classification, and transformer inference — to run distributed across a cluster, handling datasets that single-machine NLP libraries cannot process efficiently.

What Spark NLP Does

  • Provides 50+ NLP annotators including tokenizer, stemmer, lemmatizer, NER, and POS tagger
  • Runs BERT, RoBERTa, DeBERTa, and other transformer models inside Spark pipelines
  • Scales NLP processing across Spark clusters for terabyte-scale text corpora
  • Supports ONNX model import for running custom-trained models at scale
  • Offers pre-trained pipelines and models for 200+ languages

Architecture Overview

Spark NLP annotators extend Spark ML's Estimator and Transformer interfaces, making them composable in standard Spark ML pipelines. Each annotator reads annotation columns and produces new ones. Transformer-based annotators load ONNX or TensorFlow SavedModel weights and run inference using a JVM-native runtime, distributing batches across Spark executors. The library avoids Python UDFs to maintain native Spark performance.

Self-Hosting & Configuration

  • Install via pip and start a Spark session with sparknlp.start()
  • Alternatively add the Maven package to an existing Spark cluster configuration
  • Download pre-trained models from the John Snow Labs model hub
  • Configure GPU inference by setting spark.jars.packages with the GPU variant
  • Tune batch sizes and partition counts for optimal throughput on your cluster

Key Features

  • Native Spark ML integration means no data serialization between Python and JVM
  • Pre-trained models cover 200+ languages including clinical and legal domains
  • ONNX runtime support enables importing models trained in PyTorch or TensorFlow
  • Healthcare and legal NLP editions provide domain-specific entity recognition
  • Runs on Databricks, EMR, Dataproc, and any Spark environment without modification

Comparison with Similar Tools

  • spaCy — single-machine NLP with fast inference; Spark NLP distributes across clusters
  • Hugging Face Transformers — Python-native transformer library; Spark NLP runs transformers inside Spark pipelines
  • Stanza — Stanford's neural NLP library; Spark NLP scales to distributed datasets
  • Flair — PyTorch NLP framework; Spark NLP provides native Spark integration without Python UDFs
  • NLTK — educational NLP toolkit; Spark NLP is production-focused with distributed computing support

FAQ

Q: Does Spark NLP require a Spark cluster? A: No. It works in local mode with sparknlp.start() for development, and scales to clusters for production workloads.

Q: Can I use GPU acceleration? A: Yes. Install the GPU variant and configure Spark to use GPU resources for transformer inference.

Q: How many languages are supported? A: Over 200 languages with pre-trained models, including specialized models for healthcare and legal text.

Q: Is it compatible with Databricks? A: Yes. Spark NLP runs on Databricks, AWS EMR, Google Dataproc, and any standard Spark environment.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados