Introduction
Ray is the open-source distributed computing framework that powers OpenAI's training infrastructure, Uber's ML platform, and Pinterest's ranking systems. Created at UC Berkeley (RISELab) and developed by Anyscale, Ray scales Python from a single laptop to a multi-thousand-node cluster with the same code.
With over 42,000 GitHub stars, Ray is the most general-purpose AI infrastructure project: distributed Python tasks (Ray Core), training (Ray Train), tuning (Ray Tune), serving (Ray Serve), and data processing (Ray Data) — all on one unified runtime.
What Ray Does
Ray provides a @ray.remote decorator that turns Python functions and classes into distributed tasks/actors. The Ray runtime handles scheduling, fault tolerance, and inter-process communication. Higher-level libraries (Train, Tune, Serve, Data, RLlib) sit on top of this core for ML-specific workflows.
Architecture Overview
[Driver Process]
Your Python script
|
[Ray Cluster Runtime]
Head node + worker nodes
|
+--------+--------+--------+--------+
| | | | |
Ray Core Train Tune Serve Data
tasks / distrib hyper- online distrib
actors training param model data
(PyTorch search serving proc
Lightning, ...)
|
[Object Store + Plasma]
zero-copy shared memory between tasks
|
[Autoscaler]
add/remove EC2/GCE/Kubernetes nodesSelf-Hosting & Configuration
# Distributed actors
import ray
@ray.remote
class Counter:
def __init__(self): self.n = 0
def add(self, x): self.n += x; return self.n
c = Counter.remote()
print(ray.get([c.add.remote(i) for i in range(10)]))
# Distributed training with Ray Train (PyTorch DDP)
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
def train_func():
import torch.nn as nn
model = nn.Linear(10, 1)
# ... full DDP setup handled by Ray Train
trainer = TorchTrainer(train_func, scaling_config=ScalingConfig(num_workers=4, use_gpu=True))
trainer.fit()
# Online model serving with Ray Serve
from ray import serve
@serve.deployment(num_replicas=4, route_prefix="/predict")
class Model:
def __call__(self, request):
return {"result": "hello"}
serve.run(Model.bind())Key Features
- Ray Core — distributed tasks/actors with @ray.remote
- Ray Train — distributed training (PyTorch, TensorFlow, XGBoost)
- Ray Tune — hyperparameter search at scale (ASHA, BOHB, HyperOpt)
- Ray Serve — production model serving with autoscaling
- Ray Data — distributed data preprocessing for ML pipelines
- RLlib — reinforcement learning library at industrial scale
- Autoscaler — managed cluster expansion on AWS/GCP/Azure/Kubernetes
- Object store — zero-copy shared memory for fast intra-cluster transfers
Comparison with Similar Tools
| Feature | Ray | Dask | Spark | Modal | Celery |
|---|---|---|---|---|---|
| Python-native | Yes | Yes | Wrapper | Yes | Yes |
| ML libraries | Train/Tune/Serve/RLlib | dask-ml | MLlib | Custom | None |
| Online serving | Yes (Ray Serve) | No | No | Yes | No |
| Stateful actors | Yes | Limited | Limited | Limited | No |
| Cluster management | Built-in autoscaler | Limited | Yarn/K8s | Managed | None |
| Best For | AI/ML at scale | Pythonic data science | Big data ETL | Serverless GPUs | Background jobs |
FAQ
Q: Ray vs Dask? A: Dask is great for parallelizing pandas/NumPy workflows. Ray is broader: actors, distributed RL, online serving, training. For pure DataFrame work, Dask is simpler; for ML platforms, Ray is the standard.
Q: Ray vs Spark? A: Spark dominates traditional big-data ETL. Ray dominates ML training and serving. Many platforms run both: Spark for upstream data prep, Ray for downstream training.
Q: Do I need Anyscale? A: No — Ray is fully open source and runs on your own infrastructure (laptop, EC2, Kubernetes, KubeRay). Anyscale offers a managed service if you don't want to operate clusters.
Q: How does it scale? A: From single-machine multiprocessing to thousands of nodes with the same code. The autoscaler talks to AWS/GCP/Azure or KubeRay (Kubernetes operator) to add/remove workers based on demand.
Sources
- GitHub: https://github.com/ray-project/ray
- Docs: https://docs.ray.io
- Company: Anyscale
- License: Apache-2.0