ConfigsApr 28, 2026·3 min read

Horovod — Distributed Deep Learning Training Framework

A distributed training framework for TensorFlow, Keras, PyTorch, and MXNet that scales model training across multiple GPUs and nodes with minimal code changes.

Introduction

Horovod is a distributed deep learning training framework originally developed at Uber. It uses ring-allreduce and MPI concepts to scale training across multiple GPUs and machines with just a few lines of additional code, regardless of the deep learning framework being used.

What Horovod Does

  • Distributes training across GPUs with ring-allreduce gradient aggregation
  • Supports PyTorch, TensorFlow, Keras, and Apache MXNet
  • Scales from a single machine to hundreds of nodes with near-linear speedup
  • Provides Elastic Horovod for fault-tolerant training on preemptible instances
  • Works with Spark for data-parallel training on existing Spark clusters

Architecture Overview

Horovod wraps the training optimizer to intercept gradient tensors and perform allreduce across all workers. It uses NCCL for GPU-to-GPU communication and MPI or Gloo for coordination. The ring-allreduce algorithm divides gradient tensors into chunks and pipelines them around a logical ring, achieving bandwidth-optimal communication. Tensor Fusion batches small tensors together to reduce overhead.

Self-Hosting & Configuration

  • Requires MPI (OpenMPI recommended) or Gloo as the communication backend
  • Install with framework flags: HOROVOD_WITH_PYTORCH=1 pip install horovod
  • NCCL required for multi-GPU training; set HOROVOD_GPU_OPERATIONS=NCCL
  • Use horovodrun or mpirun to launch distributed jobs
  • Supports deployment on Kubernetes, Spark, Ray, and bare-metal clusters

Key Features

  • Near-linear scaling efficiency with ring-allreduce and tensor fusion
  • Elastic training that adapts to node additions and removals at runtime
  • Timeline profiling for debugging communication bottlenecks
  • Auto-tuning for fusion buffer size and cycle time
  • Integration with Spark MLlib for unified data and training pipelines

Comparison with Similar Tools

  • PyTorch DDP — Native PyTorch solution; Horovod offers multi-framework support
  • DeepSpeed — Focuses on ZeRO memory optimization; Horovod is simpler for basic distribution
  • Ray Train — Higher-level API with fault tolerance; Horovod provides lower-level MPI control
  • tf.distribute — TensorFlow-only; Horovod gives a consistent API across frameworks

FAQ

Q: How many lines of code to distribute training? A: Typically 5-10 lines: init, wrap optimizer, broadcast initial parameters, and adjust data sampler.

Q: Does Horovod support elastic scaling? A: Yes. Elastic Horovod allows workers to join or leave during training, useful for spot/preemptible instances.

Q: What hardware is required? A: Any machine with NVIDIA GPUs and NCCL, or CPUs with Gloo. InfiniBand supported for high-bandwidth clusters.

Q: Is Horovod still actively maintained? A: The project is in maintenance mode with community contributions. For new projects, consider PyTorch DDP or DeepSpeed.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets