# Megatron-LM — Train Transformer Models at Scale by NVIDIA > NVIDIA's research framework for efficient large-scale training of transformer models with tensor, pipeline, and sequence parallelism. ## Install Save in your project root: # Megatron-LM — Train Transformer Models at Scale by NVIDIA ## Quick Use ```bash git clone https://github.com/NVIDIA/Megatron-LM.git cd Megatron-LM && pip install -e . # Launch distributed training (example with 8 GPUs) torchrun --nproc_per_node=8 pretrain_gpt.py --config-file examples/gpt3/config.yaml ``` ## Introduction Megatron-LM is NVIDIA's open-source framework for training large transformer models across hundreds or thousands of GPUs. It pioneered tensor parallelism and pipeline parallelism techniques that are now standard in large-scale LLM training, and its Megatron-Core library is used by many LLM training pipelines. ## What Megatron-LM Does - Implements tensor parallelism to split individual transformer layers across GPUs - Provides pipeline parallelism to distribute model stages across GPU groups - Supports sequence parallelism for long-context training efficiency - Includes context parallelism for training with very long sequences (100K+ tokens) - Offers Megatron-Core as a composable library for building custom training loops ## Architecture Overview Megatron-LM splits model computation using a 3D parallelism strategy: tensor parallelism within a node, pipeline parallelism across nodes, and data parallelism across pipeline replicas. Megatron-Core provides modular components (attention, MLP, embeddings) that handle the distributed communication internally. ## Self-Hosting & Configuration - Requires NVIDIA GPUs with NCCL for inter-GPU communication - Install via pip from the repository with PyTorch and CUDA dependencies - Configure parallelism dimensions via command-line arguments - Integrates with NVIDIA's NeMo framework for higher-level training workflows - Supports mixed precision training (BF16, FP8) with TransformerEngine ## Key Features - Industry-proven training framework used in GPT-3, Llama, and many foundation models - Achieves near-linear scaling efficiency across thousands of GPUs - FP8 training support via TransformerEngine for Hopper GPU architecture - Built-in data pipeline with efficient dataset sharding and tokenization - Distributed checkpointing with automatic resharding across parallelism configs ## Comparison with Similar Tools - **DeepSpeed** — Microsoft's distributed training library; Megatron-LM focuses on parallelism-first design for very large models - **ColossalAI** — community alternative with similar parallelism support; Megatron-LM is more battle-tested at extreme scale - **FSDP (PyTorch)** — built-in sharded training; Megatron-LM offers finer-grained parallelism control - **LlamaFactory** — high-level fine-tuning tool; Megatron-LM targets large-scale pretraining from scratch - **Axolotl** — fine-tuning focused; Megatron-LM is designed for multi-node pretraining workloads ## FAQ **Q: Can I use Megatron-LM for fine-tuning?** A: Yes, but it is primarily designed for pretraining. For fine-tuning, tools like NeMo or LlamaFactory may be more convenient. **Q: How many GPUs do I need?** A: Megatron-LM scales from a single GPU to thousands, but its parallelism features shine at 8+ GPUs. **Q: Does it support non-NVIDIA hardware?** A: No. It requires NVIDIA GPUs and CUDA. AMD and other accelerators are not supported. **Q: What is Megatron-Core?** A: A modular library extracted from Megatron-LM that provides reusable distributed transformer building blocks for custom training pipelines. ## Sources - https://github.com/NVIDIA/Megatron-LM - https://docs.nvidia.com/megatron-core/ --- Source: https://tokrepo.com/en/workflows/9a92881a-416b-11f1-9bc6-00163e2b0d79 Author: AI Open Source