# Horovod — Distributed Deep Learning Training Framework > A distributed training framework for TensorFlow, Keras, PyTorch, and MXNet that scales model training across multiple GPUs and nodes with minimal code changes. ## Install Save in your project root: # Horovod — Distributed Deep Learning Training Framework ## Quick Use ```bash pip install horovod horovodrun -np 4 python train.py ``` ```python import horovod.torch as hvd hvd.init() optimizer = hvd.DistributedOptimizer(optimizer) hvd.broadcast_parameters(model.state_dict(), root_rank=0) ``` ## Introduction Horovod is a distributed deep learning training framework originally developed at Uber. It uses ring-allreduce and MPI concepts to scale training across multiple GPUs and machines with just a few lines of additional code, regardless of the deep learning framework being used. ## What Horovod Does - Distributes training across GPUs with ring-allreduce gradient aggregation - Supports PyTorch, TensorFlow, Keras, and Apache MXNet - Scales from a single machine to hundreds of nodes with near-linear speedup - Provides Elastic Horovod for fault-tolerant training on preemptible instances - Works with Spark for data-parallel training on existing Spark clusters ## Architecture Overview Horovod wraps the training optimizer to intercept gradient tensors and perform allreduce across all workers. It uses NCCL for GPU-to-GPU communication and MPI or Gloo for coordination. The ring-allreduce algorithm divides gradient tensors into chunks and pipelines them around a logical ring, achieving bandwidth-optimal communication. Tensor Fusion batches small tensors together to reduce overhead. ## Self-Hosting & Configuration - Requires MPI (OpenMPI recommended) or Gloo as the communication backend - Install with framework flags: HOROVOD_WITH_PYTORCH=1 pip install horovod - NCCL required for multi-GPU training; set HOROVOD_GPU_OPERATIONS=NCCL - Use horovodrun or mpirun to launch distributed jobs - Supports deployment on Kubernetes, Spark, Ray, and bare-metal clusters ## Key Features - Near-linear scaling efficiency with ring-allreduce and tensor fusion - Elastic training that adapts to node additions and removals at runtime - Timeline profiling for debugging communication bottlenecks - Auto-tuning for fusion buffer size and cycle time - Integration with Spark MLlib for unified data and training pipelines ## Comparison with Similar Tools - **PyTorch DDP** — Native PyTorch solution; Horovod offers multi-framework support - **DeepSpeed** — Focuses on ZeRO memory optimization; Horovod is simpler for basic distribution - **Ray Train** — Higher-level API with fault tolerance; Horovod provides lower-level MPI control - **tf.distribute** — TensorFlow-only; Horovod gives a consistent API across frameworks ## FAQ **Q: How many lines of code to distribute training?** A: Typically 5-10 lines: init, wrap optimizer, broadcast initial parameters, and adjust data sampler. **Q: Does Horovod support elastic scaling?** A: Yes. Elastic Horovod allows workers to join or leave during training, useful for spot/preemptible instances. **Q: What hardware is required?** A: Any machine with NVIDIA GPUs and NCCL, or CPUs with Gloo. InfiniBand supported for high-bandwidth clusters. **Q: Is Horovod still actively maintained?** A: The project is in maintenance mode with community contributions. For new projects, consider PyTorch DDP or DeepSpeed. ## Sources - https://github.com/horovod/horovod - https://horovod.readthedocs.io --- Source: https://tokrepo.com/en/workflows/82f318ad-42dc-11f1-9bc6-00163e2b0d79 Author: AI Open Source