# Megatron-LM — Train Transformer Models at Scale by NVIDIA

> NVIDIA's research framework for efficient large-scale training of transformer models with tensor, pipeline, and sequence parallelism.

## Install

Save in your project root:

# Megatron-LM — Train Transformer Models at Scale by NVIDIA

## Quick Use
```bash
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM && pip install -e .
# Launch distributed training (example with 8 GPUs)
torchrun --nproc_per_node=8 pretrain_gpt.py --config-file examples/gpt3/config.yaml
```

## Introduction
Megatron-LM is NVIDIA's open-source framework for training large transformer models across hundreds or thousands of GPUs. It pioneered tensor parallelism and pipeline parallelism techniques that are now standard in large-scale LLM training, and its Megatron-Core library is used by many LLM training pipelines.

## What Megatron-LM Does
- Implements tensor parallelism to split individual transformer layers across GPUs
- Provides pipeline parallelism to distribute model stages across GPU groups
- Supports sequence parallelism for long-context training efficiency
- Includes context parallelism for training with very long sequences (100K+ tokens)
- Offers Megatron-Core as a composable library for building custom training loops

## Architecture Overview
Megatron-LM splits model computation using a 3D parallelism strategy: tensor parallelism within a node, pipeline parallelism across nodes, and data parallelism across pipeline replicas. Megatron-Core provides modular components (attention, MLP, embeddings) that handle the distributed communication internally.

## Self-Hosting & Configuration
- Requires NVIDIA GPUs with NCCL for inter-GPU communication
- Install via pip from the repository with PyTorch and CUDA dependencies
- Configure parallelism dimensions via command-line arguments
- Integrates with NVIDIA's NeMo framework for higher-level training workflows
- Supports mixed precision training (BF16, FP8) with TransformerEngine

## Key Features
- Industry-proven training framework used in GPT-3, Llama, and many foundation models
- Achieves near-linear scaling efficiency across thousands of GPUs
- FP8 training support via TransformerEngine for Hopper GPU architecture
- Built-in data pipeline with efficient dataset sharding and tokenization
- Distributed checkpointing with automatic resharding across parallelism configs

## Comparison with Similar Tools
- **DeepSpeed** — Microsoft's distributed training library; Megatron-LM focuses on parallelism-first design for very large models
- **ColossalAI** — community alternative with similar parallelism support; Megatron-LM is more battle-tested at extreme scale
- **FSDP (PyTorch)** — built-in sharded training; Megatron-LM offers finer-grained parallelism control
- **LlamaFactory** — high-level fine-tuning tool; Megatron-LM targets large-scale pretraining from scratch
- **Axolotl** — fine-tuning focused; Megatron-LM is designed for multi-node pretraining workloads

## FAQ
**Q: Can I use Megatron-LM for fine-tuning?**
A: Yes, but it is primarily designed for pretraining. For fine-tuning, tools like NeMo or LlamaFactory may be more convenient.

**Q: How many GPUs do I need?**
A: Megatron-LM scales from a single GPU to thousands, but its parallelism features shine at 8+ GPUs.

**Q: Does it support non-NVIDIA hardware?**
A: No. It requires NVIDIA GPUs and CUDA. AMD and other accelerators are not supported.

**Q: What is Megatron-Core?**
A: A modular library extracted from Megatron-LM that provides reusable distributed transformer building blocks for custom training pipelines.

## Sources
- https://github.com/NVIDIA/Megatron-LM
- https://docs.nvidia.com/megatron-core/

---
Source: https://tokrepo.com/en/workflows/9a92881a-416b-11f1-9bc6-00163e2b0d79
Author: AI Open Source