How do I install NVIDIA NeMo — Toolkit for Building and Training AI Models?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

NVIDIA NeMo — Toolkit for Building and Training AI Models

Introduction

NVIDIA NeMo is a framework for researchers and developers who need to build, train, and deploy conversational AI and generative AI models at scale. It provides pre-built collections for LLMs, automatic speech recognition (ASR), text-to-speech (TTS), and NLP tasks, all optimized for NVIDIA GPU clusters with integrated distributed training.

What NeMo Does

Trains and fine-tunes large language models using tensor, pipeline, and expert parallelism
Provides end-to-end ASR pipelines with pre-trained Conformer and FastConformer models
Supports TTS model training including FastPitch, HiFi-GAN, and RADTTS
Implements RLHF, DPO, and SFT alignment methods for instruction-tuning LLMs
Exports models to NVIDIA TensorRT-LLM and Triton for optimized production serving

Architecture Overview

NeMo is built on PyTorch and uses NVIDIA Megatron-LM for distributed LLM training with 3D parallelism (tensor, pipeline, data). Models are defined as collections of Neural Modules that connect via typed ports. A YAML-based configuration system (via Hydra/OmegaConf) controls every training parameter. NeMo Curator handles data preprocessing at scale, while NeMo Guardrails adds safety controls for deployed models.

Self-Hosting & Configuration

Install via pip: pip install nemo_toolkit[all] requires PyTorch and CUDA
Use NVIDIA NGC containers for pre-configured environments: nvcr.io/nvidia/nemo
Training configs are YAML files specifying model architecture, data, optimizer, and parallelism
Multi-GPU training uses torchrun or NeMo's built-in launcher with Slurm integration
Fine-tune with LoRA or P-tuning via config overrides: model.peft.peft_scheme=lora

Key Features

Scales from single GPU to thousands of GPUs with automatic parallelism strategies
Pre-trained model zoo on NVIDIA NGC with models for ASR, TTS, NLP, and LLMs
NeMo Curator for large-scale data deduplication, filtering, and quality scoring
NeMo Guardrails for adding programmable safety rails to deployed LLM applications
Seamless export to TensorRT-LLM for up to 8x inference speedup on NVIDIA hardware

Comparison with Similar Tools

Hugging Face Transformers — broader model coverage but NeMo provides better multi-node training at scale
DeepSpeed — focuses on distributed training optimization; NeMo provides full training recipes and model collections
Axolotl — simpler fine-tuning setup but NeMo handles pre-training and larger-scale training
Megatron-LM — NeMo builds on Megatron and adds ASR, TTS, data curation, and configuration management
vLLM — inference-only; NeMo covers the full lifecycle from data prep through training to deployment

FAQ

Q: Do I need NVIDIA GPUs to use NeMo? A: Yes, NeMo is optimized for NVIDIA GPUs. Training requires CUDA-capable GPUs, and many features leverage NVIDIA-specific libraries like cuDNN and NCCL.

Q: Can NeMo fine-tune open-weight models like LLaMA? A: Yes, NeMo supports SFT, LoRA, and RLHF/DPO fine-tuning for LLaMA, Mistral, Gemma, and other architectures with pre-built recipes.

Q: How does NeMo handle data preprocessing? A: NeMo Curator provides GPU-accelerated data pipelines for deduplication, quality filtering, PII removal, and domain classification at petabyte scale.

Q: Is NeMo suitable for speech applications? A: Yes, NeMo has extensive ASR and TTS collections with pre-trained models supporting 100+ languages and streaming inference.

NVIDIA NeMo — Toolkit for Building and Training AI Models

Introduction

What NeMo Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Ludwig — Low-Code Framework for Building Custom AI Models

LightGBM — Light Gradient Boosting Framework by Microsoft

PEFT — Parameter-Efficient Fine-Tuning for Large Language Models