Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsApr 28, 2026·3 min de lectura

Horovod — Distributed Deep Learning Training Framework

A distributed training framework for TensorFlow, Keras, PyTorch, and MXNet that scales model training across multiple GPUs and nodes with minimal code changes.

Introduction

Horovod is a distributed deep learning training framework originally developed at Uber. It uses ring-allreduce and MPI concepts to scale training across multiple GPUs and machines with just a few lines of additional code, regardless of the deep learning framework being used.

What Horovod Does

  • Distributes training across GPUs with ring-allreduce gradient aggregation
  • Supports PyTorch, TensorFlow, Keras, and Apache MXNet
  • Scales from a single machine to hundreds of nodes with near-linear speedup
  • Provides Elastic Horovod for fault-tolerant training on preemptible instances
  • Works with Spark for data-parallel training on existing Spark clusters

Architecture Overview

Horovod wraps the training optimizer to intercept gradient tensors and perform allreduce across all workers. It uses NCCL for GPU-to-GPU communication and MPI or Gloo for coordination. The ring-allreduce algorithm divides gradient tensors into chunks and pipelines them around a logical ring, achieving bandwidth-optimal communication. Tensor Fusion batches small tensors together to reduce overhead.

Self-Hosting & Configuration

  • Requires MPI (OpenMPI recommended) or Gloo as the communication backend
  • Install with framework flags: HOROVOD_WITH_PYTORCH=1 pip install horovod
  • NCCL required for multi-GPU training; set HOROVOD_GPU_OPERATIONS=NCCL
  • Use horovodrun or mpirun to launch distributed jobs
  • Supports deployment on Kubernetes, Spark, Ray, and bare-metal clusters

Key Features

  • Near-linear scaling efficiency with ring-allreduce and tensor fusion
  • Elastic training that adapts to node additions and removals at runtime
  • Timeline profiling for debugging communication bottlenecks
  • Auto-tuning for fusion buffer size and cycle time
  • Integration with Spark MLlib for unified data and training pipelines

Comparison with Similar Tools

  • PyTorch DDP — Native PyTorch solution; Horovod offers multi-framework support
  • DeepSpeed — Focuses on ZeRO memory optimization; Horovod is simpler for basic distribution
  • Ray Train — Higher-level API with fault tolerance; Horovod provides lower-level MPI control
  • tf.distribute — TensorFlow-only; Horovod gives a consistent API across frameworks

FAQ

Q: How many lines of code to distribute training? A: Typically 5-10 lines: init, wrap optimizer, broadcast initial parameters, and adjust data sampler.

Q: Does Horovod support elastic scaling? A: Yes. Elastic Horovod allows workers to join or leave during training, useful for spot/preemptible instances.

Q: What hardware is required? A: Any machine with NVIDIA GPUs and NCCL, or CPUs with Gloo. InfiniBand supported for high-bandwidth clusters.

Q: Is Horovod still actively maintained? A: The project is in maintenance mode with community contributions. For new projects, consider PyTorch DDP or DeepSpeed.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados