# Stanza — Stanford NLP Library for 70+ Human Languages

> A Python NLP library from Stanford providing tokenization, POS tagging, NER, dependency parsing, and lemmatization for over 70 languages.

## Install

Save in your project root:

# Stanza — Stanford NLP Library for 70+ Human Languages

## Quick Use
```bash
pip install stanza
```
```python
import stanza
stanza.download("en")
nlp = stanza.Pipeline("en")
doc = nlp("Barack Obama was born in Hawaii.")
for ent in doc.ents:
    print(ent.text, ent.type)
# Barack Obama PERSON
# Hawaii GPE
```

## Introduction
Stanza is the official Python NLP library from the Stanford NLP Group. It provides neural network models for tokenization, multi-word token expansion, lemmatization, part-of-speech tagging, morphological feature tagging, dependency parsing, and named entity recognition across more than 70 languages.

## What Stanza Does
- Tokenizes and segments text into sentences for over 70 languages
- Performs part-of-speech tagging and morphological feature analysis
- Parses syntactic dependency trees following Universal Dependencies standards
- Recognizes named entities (persons, locations, organizations) in multiple languages
- Provides a Python interface to Stanford CoreNLP's Java-based tools

## Architecture Overview
Stanza's pipeline processes text through sequential neural network modules. The tokenizer uses a bi-LSTM over characters to segment text into tokens and sentences. Downstream components (POS tagger, lemmatizer, dependency parser, NER) each apply task-specific bi-LSTM or transformer architectures. Models are pre-trained on Universal Dependencies treebanks, ensuring cross-lingual consistency. An optional CoreNLP client wraps the full Java Stanford NLP toolkit.

## Self-Hosting & Configuration
- Install via pip and download language models with stanza.download()
- Configure the pipeline by selecting which processors to include
- Use GPU acceleration by setting use_gpu=True in the Pipeline constructor
- Download models once and reuse from a local cache directory
- Wrap the Java Stanford CoreNLP server for additional annotators via the CoreNLPClient

## Key Features
- Covers 70+ languages with pre-trained models from Universal Dependencies treebanks
- Achieves state-of-the-art accuracy on many languages for POS, NER, and parsing
- Modular pipeline lets you enable only the processors you need
- Seamlessly integrates with Stanford CoreNLP for sentiment, coreference, and relation extraction
- Models are compact and run efficiently on both CPU and GPU

## Comparison with Similar Tools
- **spaCy** — production-focused NLP library with fast inference; Stanza prioritizes cross-lingual coverage and accuracy
- **NLTK** — educational NLP toolkit with rule-based methods; Stanza uses modern neural models throughout
- **Flair** — NLP framework built on PyTorch embeddings; Stanza offers broader language coverage via UD models
- **Hugging Face Transformers** — general-purpose transformer models; Stanza provides ready-made linguistic annotation pipelines
- **CoreNLP** — Java-based NLP suite; Stanza is its Python successor with native neural models

## FAQ
**Q: How many languages does Stanza support?**
A: Over 70 languages with pre-trained models, covering major world languages and many under-resourced ones.

**Q: Can I train custom models?**
A: Yes. Stanza supports training on custom CoNLL-U formatted data for all pipeline components.

**Q: Does it require a GPU?**
A: No. All models run on CPU, though GPU acceleration significantly speeds up processing for large datasets.

**Q: How does it relate to Stanford CoreNLP?**
A: Stanza is the modern Python replacement. It includes its own neural models and optionally wraps CoreNLP's Java server for additional annotators.

## Sources
- https://github.com/stanfordnlp/stanza
- https://stanfordnlp.github.io/stanza/

---
Source: https://tokrepo.com/en/workflows/asset-94ab44a4
Author: AI Open Source