How do I install Megatron-LM — Train Transformer Models at Scale by NVIDIA?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Megatron-LM — Train Transformer Models at Scale by NVIDIA

Introduction

Megatron-LM is NVIDIA's open-source framework for training large transformer models across hundreds or thousands of GPUs. It pioneered tensor parallelism and pipeline parallelism techniques that are now standard in large-scale LLM training, and its Megatron-Core library is used by many LLM training pipelines.

What Megatron-LM Does

Implements tensor parallelism to split individual transformer layers across GPUs
Provides pipeline parallelism to distribute model stages across GPU groups
Supports sequence parallelism for long-context training efficiency
Includes context parallelism for training with very long sequences (100K+ tokens)
Offers Megatron-Core as a composable library for building custom training loops

Architecture Overview

Megatron-LM splits model computation using a 3D parallelism strategy: tensor parallelism within a node, pipeline parallelism across nodes, and data parallelism across pipeline replicas. Megatron-Core provides modular components (attention, MLP, embeddings) that handle the distributed communication internally.

Self-Hosting & Configuration

Requires NVIDIA GPUs with NCCL for inter-GPU communication
Install via pip from the repository with PyTorch and CUDA dependencies
Configure parallelism dimensions via command-line arguments
Integrates with NVIDIA's NeMo framework for higher-level training workflows
Supports mixed precision training (BF16, FP8) with TransformerEngine

Key Features

Industry-proven training framework used in GPT-3, Llama, and many foundation models
Achieves near-linear scaling efficiency across thousands of GPUs
FP8 training support via TransformerEngine for Hopper GPU architecture
Built-in data pipeline with efficient dataset sharding and tokenization
Distributed checkpointing with automatic resharding across parallelism configs

Comparison with Similar Tools

DeepSpeed — Microsoft's distributed training library; Megatron-LM focuses on parallelism-first design for very large models
ColossalAI — community alternative with similar parallelism support; Megatron-LM is more battle-tested at extreme scale
FSDP (PyTorch) — built-in sharded training; Megatron-LM offers finer-grained parallelism control
LlamaFactory — high-level fine-tuning tool; Megatron-LM targets large-scale pretraining from scratch
Axolotl — fine-tuning focused; Megatron-LM is designed for multi-node pretraining workloads

FAQ

Q: Can I use Megatron-LM for fine-tuning? A: Yes, but it is primarily designed for pretraining. For fine-tuning, tools like NeMo or LlamaFactory may be more convenient.

Q: How many GPUs do I need? A: Megatron-LM scales from a single GPU to thousands, but its parallelism features shine at 8+ GPUs.

Q: Does it support non-NVIDIA hardware? A: No. It requires NVIDIA GPUs and CUDA. AMD and other accelerators are not supported.

Q: What is Megatron-Core? A: A modular library extracted from Megatron-LM that provides reusable distributed transformer building blocks for custom training pipelines.

Megatron-LM — Train Transformer Models at Scale by NVIDIA

Introduction

What Megatron-LM Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

LLM Foundry — LLM Training Code for Foundation Models by Databricks

Flyte — Resilient AI and Data Workflow Orchestration

PageIndex — Document Index for Reasoning-Based RAG