# Horovod — Distributed Deep Learning Training Framework

> A distributed training framework for TensorFlow, Keras, PyTorch, and MXNet that scales model training across multiple GPUs and nodes with minimal code changes.

## Install

Save in your project root:

# Horovod — Distributed Deep Learning Training Framework

## Quick Use
```bash
pip install horovod
horovodrun -np 4 python train.py
```
```python
import horovod.torch as hvd
hvd.init()
optimizer = hvd.DistributedOptimizer(optimizer)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
```

## Introduction
Horovod is a distributed deep learning training framework originally developed at Uber. It uses ring-allreduce and MPI concepts to scale training across multiple GPUs and machines with just a few lines of additional code, regardless of the deep learning framework being used.

## What Horovod Does
- Distributes training across GPUs with ring-allreduce gradient aggregation
- Supports PyTorch, TensorFlow, Keras, and Apache MXNet
- Scales from a single machine to hundreds of nodes with near-linear speedup
- Provides Elastic Horovod for fault-tolerant training on preemptible instances
- Works with Spark for data-parallel training on existing Spark clusters

## Architecture Overview
Horovod wraps the training optimizer to intercept gradient tensors and perform allreduce across all workers. It uses NCCL for GPU-to-GPU communication and MPI or Gloo for coordination. The ring-allreduce algorithm divides gradient tensors into chunks and pipelines them around a logical ring, achieving bandwidth-optimal communication. Tensor Fusion batches small tensors together to reduce overhead.

## Self-Hosting & Configuration
- Requires MPI (OpenMPI recommended) or Gloo as the communication backend
- Install with framework flags: HOROVOD_WITH_PYTORCH=1 pip install horovod
- NCCL required for multi-GPU training; set HOROVOD_GPU_OPERATIONS=NCCL
- Use horovodrun or mpirun to launch distributed jobs
- Supports deployment on Kubernetes, Spark, Ray, and bare-metal clusters

## Key Features
- Near-linear scaling efficiency with ring-allreduce and tensor fusion
- Elastic training that adapts to node additions and removals at runtime
- Timeline profiling for debugging communication bottlenecks
- Auto-tuning for fusion buffer size and cycle time
- Integration with Spark MLlib for unified data and training pipelines

## Comparison with Similar Tools
- **PyTorch DDP** — Native PyTorch solution; Horovod offers multi-framework support
- **DeepSpeed** — Focuses on ZeRO memory optimization; Horovod is simpler for basic distribution
- **Ray Train** — Higher-level API with fault tolerance; Horovod provides lower-level MPI control
- **tf.distribute** — TensorFlow-only; Horovod gives a consistent API across frameworks

## FAQ
**Q: How many lines of code to distribute training?**
A: Typically 5-10 lines: init, wrap optimizer, broadcast initial parameters, and adjust data sampler.

**Q: Does Horovod support elastic scaling?**
A: Yes. Elastic Horovod allows workers to join or leave during training, useful for spot/preemptible instances.

**Q: What hardware is required?**
A: Any machine with NVIDIA GPUs and NCCL, or CPUs with Gloo. InfiniBand supported for high-bandwidth clusters.

**Q: Is Horovod still actively maintained?**
A: The project is in maintenance mode with community contributions. For new projects, consider PyTorch DDP or DeepSpeed.

## Sources
- https://github.com/horovod/horovod
- https://horovod.readthedocs.io

---
Source: https://tokrepo.com/en/workflows/82f318ad-42dc-11f1-9bc6-00163e2b0d79
Author: AI Open Source