Scripts2026年4月14日·1 分钟阅读

Ray — Distributed Computing for Python and AI Workloads

Ray is a unified framework for scaling Python and AI applications. From distributed training and hyperparameter search to large-scale data processing and model serving — Ray powers the infrastructure behind ChatGPT, Uber, and Pinterest.

Script Depot
Script Depot · Community

Introduction

Ray is the open-source distributed computing framework that powers OpenAI's training infrastructure, Uber's ML platform, and Pinterest's ranking systems. Created at UC Berkeley (RISELab) and developed by Anyscale, Ray scales Python from a single laptop to a multi-thousand-node cluster with the same code.

With over 42,000 GitHub stars, Ray is the most general-purpose AI infrastructure project: distributed Python tasks (Ray Core), training (Ray Train), tuning (Ray Tune), serving (Ray Serve), and data processing (Ray Data) — all on one unified runtime.

What Ray Does

Ray provides a @ray.remote decorator that turns Python functions and classes into distributed tasks/actors. The Ray runtime handles scheduling, fault tolerance, and inter-process communication. Higher-level libraries (Train, Tune, Serve, Data, RLlib) sit on top of this core for ML-specific workflows.

Architecture Overview

[Driver Process]
  Your Python script
      |
[Ray Cluster Runtime]
  Head node + worker nodes
      |
  +--------+--------+--------+--------+
  |        |        |        |        |
Ray Core  Train    Tune    Serve     Data
 tasks /  distrib  hyper-  online   distrib
 actors   training param   model    data
          (PyTorch search  serving  proc
           Lightning, ...)
      |
[Object Store + Plasma]
  zero-copy shared memory between tasks
      |
[Autoscaler]
  add/remove EC2/GCE/Kubernetes nodes

Self-Hosting & Configuration

# Distributed actors
import ray

@ray.remote
class Counter:
    def __init__(self): self.n = 0
    def add(self, x): self.n += x; return self.n

c = Counter.remote()
print(ray.get([c.add.remote(i) for i in range(10)]))

# Distributed training with Ray Train (PyTorch DDP)
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

def train_func():
    import torch.nn as nn
    model = nn.Linear(10, 1)
    # ... full DDP setup handled by Ray Train

trainer = TorchTrainer(train_func, scaling_config=ScalingConfig(num_workers=4, use_gpu=True))
trainer.fit()

# Online model serving with Ray Serve
from ray import serve
@serve.deployment(num_replicas=4, route_prefix="/predict")
class Model:
    def __call__(self, request):
        return {"result": "hello"}
serve.run(Model.bind())

Key Features

  • Ray Core — distributed tasks/actors with @ray.remote
  • Ray Train — distributed training (PyTorch, TensorFlow, XGBoost)
  • Ray Tune — hyperparameter search at scale (ASHA, BOHB, HyperOpt)
  • Ray Serve — production model serving with autoscaling
  • Ray Data — distributed data preprocessing for ML pipelines
  • RLlib — reinforcement learning library at industrial scale
  • Autoscaler — managed cluster expansion on AWS/GCP/Azure/Kubernetes
  • Object store — zero-copy shared memory for fast intra-cluster transfers

Comparison with Similar Tools

Feature Ray Dask Spark Modal Celery
Python-native Yes Yes Wrapper Yes Yes
ML libraries Train/Tune/Serve/RLlib dask-ml MLlib Custom None
Online serving Yes (Ray Serve) No No Yes No
Stateful actors Yes Limited Limited Limited No
Cluster management Built-in autoscaler Limited Yarn/K8s Managed None
Best For AI/ML at scale Pythonic data science Big data ETL Serverless GPUs Background jobs

FAQ

Q: Ray vs Dask? A: Dask is great for parallelizing pandas/NumPy workflows. Ray is broader: actors, distributed RL, online serving, training. For pure DataFrame work, Dask is simpler; for ML platforms, Ray is the standard.

Q: Ray vs Spark? A: Spark dominates traditional big-data ETL. Ray dominates ML training and serving. Many platforms run both: Spark for upstream data prep, Ray for downstream training.

Q: Do I need Anyscale? A: No — Ray is fully open source and runs on your own infrastructure (laptop, EC2, Kubernetes, KubeRay). Anyscale offers a managed service if you don't want to operate clusters.

Q: How does it scale? A: From single-machine multiprocessing to thousands of nodes with the same code. The autoscaler talks to AWS/GCP/Azure or KubeRay (Kubernetes operator) to add/remove workers based on demand.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产