Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsJul 1, 2026·3 min de lectura

KubeRay — Run Ray Distributed Computing on Kubernetes

KubeRay is a Kubernetes operator that manages Ray clusters on Kubernetes, enabling distributed AI training, serving, and data processing workloads with automatic scaling and lifecycle management.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
KubeRay Overview
Comando de instalación directa
npx -y tokrepo@latest install 7f97f9e3-7520-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

KubeRay brings the Ray distributed computing framework to Kubernetes as a first-class citizen. Ray is widely used for distributed training, model serving (Ray Serve), and data processing, but managing Ray clusters manually is complex. KubeRay automates cluster provisioning, scaling, fault recovery, and upgrades through Kubernetes-native custom resources.

What KubeRay Does

  • Deploys and manages Ray clusters on Kubernetes via CRDs
  • Provides RayCluster, RayJob, and RayService custom resources
  • Auto-scales Ray worker nodes based on workload demand
  • Handles head node failover and worker recovery automatically
  • Integrates with Kubernetes scheduling, RBAC, and resource quotas

Architecture Overview

KubeRay consists of the KubeRay Operator (a controller that watches CRDs and reconciles cluster state), RayCluster CRD (declares a Ray head plus worker group configuration), RayJob CRD (submits a one-off job to a managed cluster), and RayService CRD (deploys a long-running Ray Serve application with rolling upgrades). The operator creates pods, services, and ingress resources to match the desired state, and monitors Ray's autoscaler to adjust worker replicas.

Self-Hosting & Configuration

  • Deploy the KubeRay operator via Helm into a dedicated namespace
  • Define RayCluster resources with head node and worker group specs including GPU requests
  • Configure Ray autoscaler parameters for dynamic worker scaling
  • Set resource limits and node affinity for GPU and CPU worker pools
  • Use RayService for production serving with zero-downtime upgrades

Key Features

  • Three CRDs cover clusters, batch jobs, and serving workloads
  • Autoscaling integrates Ray's built-in autoscaler with Kubernetes pod scheduling
  • Rolling upgrades for Ray Serve applications with zero-downtime deployments
  • GPU scheduling support for distributed training and inference workloads
  • Compatible with cloud-managed Kubernetes and bare-metal clusters

Comparison with Similar Tools

  • Manual Ray deployment — requires hand-managed VMs or containers, no auto-recovery
  • Ray on Spark — runs Ray within Spark clusters, different resource model
  • Kubeflow — broader ML platform with training operators, KubeRay focuses specifically on Ray
  • Volcano — batch scheduler that can co-exist with KubeRay for gang scheduling Ray jobs

FAQ

Q: Do I need to modify my Ray code to use KubeRay? A: No. Your existing Ray scripts run unchanged. KubeRay handles the infrastructure; Ray code connects to the head node as usual.

Q: How does KubeRay handle GPU scheduling? A: Worker group specs accept standard Kubernetes resource requests including nvidia.com/gpu. The operator creates pods with GPU requests, and Kubernetes schedules them onto GPU nodes.

Q: Can I run Ray Serve behind an ingress? A: Yes. KubeRay creates a head service that you can expose via Ingress or Gateway API for external traffic to Ray Serve endpoints.

Q: What happens when the Ray head node crashes? A: KubeRay detects the failure and recreates the head pod. GCS fault tolerance (enabled by default in newer Ray versions) allows workers to reconnect without restarting.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados