Introduction
Horovod is a distributed deep learning training framework originally developed at Uber. It uses ring-allreduce and MPI concepts to scale training across multiple GPUs and machines with just a few lines of additional code, regardless of the deep learning framework being used.
What Horovod Does
- Distributes training across GPUs with ring-allreduce gradient aggregation
- Supports PyTorch, TensorFlow, Keras, and Apache MXNet
- Scales from a single machine to hundreds of nodes with near-linear speedup
- Provides Elastic Horovod for fault-tolerant training on preemptible instances
- Works with Spark for data-parallel training on existing Spark clusters
Architecture Overview
Horovod wraps the training optimizer to intercept gradient tensors and perform allreduce across all workers. It uses NCCL for GPU-to-GPU communication and MPI or Gloo for coordination. The ring-allreduce algorithm divides gradient tensors into chunks and pipelines them around a logical ring, achieving bandwidth-optimal communication. Tensor Fusion batches small tensors together to reduce overhead.
Self-Hosting & Configuration
- Requires MPI (OpenMPI recommended) or Gloo as the communication backend
- Install with framework flags: HOROVOD_WITH_PYTORCH=1 pip install horovod
- NCCL required for multi-GPU training; set HOROVOD_GPU_OPERATIONS=NCCL
- Use horovodrun or mpirun to launch distributed jobs
- Supports deployment on Kubernetes, Spark, Ray, and bare-metal clusters
Key Features
- Near-linear scaling efficiency with ring-allreduce and tensor fusion
- Elastic training that adapts to node additions and removals at runtime
- Timeline profiling for debugging communication bottlenecks
- Auto-tuning for fusion buffer size and cycle time
- Integration with Spark MLlib for unified data and training pipelines
Comparison with Similar Tools
- PyTorch DDP — Native PyTorch solution; Horovod offers multi-framework support
- DeepSpeed — Focuses on ZeRO memory optimization; Horovod is simpler for basic distribution
- Ray Train — Higher-level API with fault tolerance; Horovod provides lower-level MPI control
- tf.distribute — TensorFlow-only; Horovod gives a consistent API across frameworks
FAQ
Q: How many lines of code to distribute training? A: Typically 5-10 lines: init, wrap optimizer, broadcast initial parameters, and adjust data sampler.
Q: Does Horovod support elastic scaling? A: Yes. Elastic Horovod allows workers to join or leave during training, useful for spot/preemptible instances.
Q: What hardware is required? A: Any machine with NVIDIA GPUs and NCCL, or CPUs with Gloo. InfiniBand supported for high-bandwidth clusters.
Q: Is Horovod still actively maintained? A: The project is in maintenance mode with community contributions. For new projects, consider PyTorch DDP or DeepSpeed.