How do I install Horovod — Distributed Deep Learning Training Framework?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Horovod — Distributed Deep Learning Training Framework

Introduction

Horovod is a distributed deep learning training framework originally developed at Uber. It uses ring-allreduce and MPI concepts to scale training across multiple GPUs and machines with just a few lines of additional code, regardless of the deep learning framework being used.

What Horovod Does

Distributes training across GPUs with ring-allreduce gradient aggregation
Supports PyTorch, TensorFlow, Keras, and Apache MXNet
Scales from a single machine to hundreds of nodes with near-linear speedup
Provides Elastic Horovod for fault-tolerant training on preemptible instances
Works with Spark for data-parallel training on existing Spark clusters

Architecture Overview

Horovod wraps the training optimizer to intercept gradient tensors and perform allreduce across all workers. It uses NCCL for GPU-to-GPU communication and MPI or Gloo for coordination. The ring-allreduce algorithm divides gradient tensors into chunks and pipelines them around a logical ring, achieving bandwidth-optimal communication. Tensor Fusion batches small tensors together to reduce overhead.

Self-Hosting & Configuration

Requires MPI (OpenMPI recommended) or Gloo as the communication backend
Install with framework flags: HOROVOD_WITH_PYTORCH=1 pip install horovod
NCCL required for multi-GPU training; set HOROVOD_GPU_OPERATIONS=NCCL
Use horovodrun or mpirun to launch distributed jobs
Supports deployment on Kubernetes, Spark, Ray, and bare-metal clusters

Key Features

Near-linear scaling efficiency with ring-allreduce and tensor fusion
Elastic training that adapts to node additions and removals at runtime
Timeline profiling for debugging communication bottlenecks
Auto-tuning for fusion buffer size and cycle time
Integration with Spark MLlib for unified data and training pipelines

Comparison with Similar Tools

PyTorch DDP — Native PyTorch solution; Horovod offers multi-framework support
DeepSpeed — Focuses on ZeRO memory optimization; Horovod is simpler for basic distribution
Ray Train — Higher-level API with fault tolerance; Horovod provides lower-level MPI control
tf.distribute — TensorFlow-only; Horovod gives a consistent API across frameworks

FAQ

Q: How many lines of code to distribute training? A: Typically 5-10 lines: init, wrap optimizer, broadcast initial parameters, and adjust data sampler.

Q: Does Horovod support elastic scaling? A: Yes. Elastic Horovod allows workers to join or leave during training, useful for spot/preemptible instances.

Q: What hardware is required? A: Any machine with NVIDIA GPUs and NCCL, or CPUs with Gloo. InfiniBand supported for high-bandwidth clusters.

Q: Is Horovod still actively maintained? A: The project is in maintenance mode with community contributions. For new projects, consider PyTorch DDP or DeepSpeed.

Horovod — Distributed Deep Learning Training Framework

Introduction

What Horovod Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Flower — Federated Learning Framework for Any ML Platform

H2O-3 — Scalable Open-Source Machine Learning Platform

Open3D — Modern Library for 3D Data Processing