# Stanza — Stanford NLP Library for 70+ Human Languages > A Python NLP library from Stanford providing tokenization, POS tagging, NER, dependency parsing, and lemmatization for over 70 languages. ## Install Save in your project root: # Stanza — Stanford NLP Library for 70+ Human Languages ## Quick Use ```bash pip install stanza ``` ```python import stanza stanza.download("en") nlp = stanza.Pipeline("en") doc = nlp("Barack Obama was born in Hawaii.") for ent in doc.ents: print(ent.text, ent.type) # Barack Obama PERSON # Hawaii GPE ``` ## Introduction Stanza is the official Python NLP library from the Stanford NLP Group. It provides neural network models for tokenization, multi-word token expansion, lemmatization, part-of-speech tagging, morphological feature tagging, dependency parsing, and named entity recognition across more than 70 languages. ## What Stanza Does - Tokenizes and segments text into sentences for over 70 languages - Performs part-of-speech tagging and morphological feature analysis - Parses syntactic dependency trees following Universal Dependencies standards - Recognizes named entities (persons, locations, organizations) in multiple languages - Provides a Python interface to Stanford CoreNLP's Java-based tools ## Architecture Overview Stanza's pipeline processes text through sequential neural network modules. The tokenizer uses a bi-LSTM over characters to segment text into tokens and sentences. Downstream components (POS tagger, lemmatizer, dependency parser, NER) each apply task-specific bi-LSTM or transformer architectures. Models are pre-trained on Universal Dependencies treebanks, ensuring cross-lingual consistency. An optional CoreNLP client wraps the full Java Stanford NLP toolkit. ## Self-Hosting & Configuration - Install via pip and download language models with stanza.download() - Configure the pipeline by selecting which processors to include - Use GPU acceleration by setting use_gpu=True in the Pipeline constructor - Download models once and reuse from a local cache directory - Wrap the Java Stanford CoreNLP server for additional annotators via the CoreNLPClient ## Key Features - Covers 70+ languages with pre-trained models from Universal Dependencies treebanks - Achieves state-of-the-art accuracy on many languages for POS, NER, and parsing - Modular pipeline lets you enable only the processors you need - Seamlessly integrates with Stanford CoreNLP for sentiment, coreference, and relation extraction - Models are compact and run efficiently on both CPU and GPU ## Comparison with Similar Tools - **spaCy** — production-focused NLP library with fast inference; Stanza prioritizes cross-lingual coverage and accuracy - **NLTK** — educational NLP toolkit with rule-based methods; Stanza uses modern neural models throughout - **Flair** — NLP framework built on PyTorch embeddings; Stanza offers broader language coverage via UD models - **Hugging Face Transformers** — general-purpose transformer models; Stanza provides ready-made linguistic annotation pipelines - **CoreNLP** — Java-based NLP suite; Stanza is its Python successor with native neural models ## FAQ **Q: How many languages does Stanza support?** A: Over 70 languages with pre-trained models, covering major world languages and many under-resourced ones. **Q: Can I train custom models?** A: Yes. Stanza supports training on custom CoNLL-U formatted data for all pipeline components. **Q: Does it require a GPU?** A: No. All models run on CPU, though GPU acceleration significantly speeds up processing for large datasets. **Q: How does it relate to Stanford CoreNLP?** A: Stanza is the modern Python replacement. It includes its own neural models and optionally wraps CoreNLP's Java server for additional annotators. ## Sources - https://github.com/stanfordnlp/stanza - https://stanfordnlp.github.io/stanza/ --- Source: https://tokrepo.com/en/workflows/asset-94ab44a4 Author: AI Open Source