How do I install Ray — Distributed Computing for Python and AI Workloads?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Ray — Distributed Computing for Python and AI Workloads

Introduction

Ray is the open-source distributed computing framework that powers OpenAI's training infrastructure, Uber's ML platform, and Pinterest's ranking systems. Created at UC Berkeley (RISELab) and developed by Anyscale, Ray scales Python from a single laptop to a multi-thousand-node cluster with the same code.

With over 42,000 GitHub stars, Ray is the most general-purpose AI infrastructure project: distributed Python tasks (Ray Core), training (Ray Train), tuning (Ray Tune), serving (Ray Serve), and data processing (Ray Data) — all on one unified runtime.

What Ray Does

Ray provides a @ray.remote decorator that turns Python functions and classes into distributed tasks/actors. The Ray runtime handles scheduling, fault tolerance, and inter-process communication. Higher-level libraries (Train, Tune, Serve, Data, RLlib) sit on top of this core for ML-specific workflows.

Architecture Overview

[Driver Process]
  Your Python script
      |
[Ray Cluster Runtime]
  Head node + worker nodes
      |
  +--------+--------+--------+--------+
  |        |        |        |        |
Ray Core  Train    Tune    Serve     Data
 tasks /  distrib  hyper-  online   distrib
 actors   training param   model    data
          (PyTorch search  serving  proc
           Lightning, ...)
      |
[Object Store + Plasma]
  zero-copy shared memory between tasks
      |
[Autoscaler]
  add/remove EC2/GCE/Kubernetes nodes

Self-Hosting & Configuration

# Distributed actors
import ray

@ray.remote
class Counter:
    def __init__(self): self.n = 0
    def add(self, x): self.n += x; return self.n

c = Counter.remote()
print(ray.get([c.add.remote(i) for i in range(10)]))

# Distributed training with Ray Train (PyTorch DDP)
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

def train_func():
    import torch.nn as nn
    model = nn.Linear(10, 1)
    # ... full DDP setup handled by Ray Train

trainer = TorchTrainer(train_func, scaling_config=ScalingConfig(num_workers=4, use_gpu=True))
trainer.fit()

# Online model serving with Ray Serve
from ray import serve
@serve.deployment(num_replicas=4, route_prefix="/predict")
class Model:
    def __call__(self, request):
        return {"result": "hello"}
serve.run(Model.bind())

Key Features

Ray Core — distributed tasks/actors with @ray.remote
Ray Train — distributed training (PyTorch, TensorFlow, XGBoost)
Ray Tune — hyperparameter search at scale (ASHA, BOHB, HyperOpt)
Ray Serve — production model serving with autoscaling
Ray Data — distributed data preprocessing for ML pipelines
RLlib — reinforcement learning library at industrial scale
Autoscaler — managed cluster expansion on AWS/GCP/Azure/Kubernetes
Object store — zero-copy shared memory for fast intra-cluster transfers

Comparison with Similar Tools

Feature	Ray	Dask	Spark	Modal	Celery
Python-native	Yes	Yes	Wrapper	Yes	Yes
ML libraries	Train/Tune/Serve/RLlib	dask-ml	MLlib	Custom	None
Online serving	Yes (Ray Serve)	No	No	Yes	No
Stateful actors	Yes	Limited	Limited	Limited	No
Cluster management	Built-in autoscaler	Limited	Yarn/K8s	Managed	None
Best For	AI/ML at scale	Pythonic data science	Big data ETL	Serverless GPUs	Background jobs

FAQ

Q: Ray vs Dask? A: Dask is great for parallelizing pandas/NumPy workflows. Ray is broader: actors, distributed RL, online serving, training. For pure DataFrame work, Dask is simpler; for ML platforms, Ray is the standard.

Q: Ray vs Spark? A: Spark dominates traditional big-data ETL. Ray dominates ML training and serving. Many platforms run both: Spark for upstream data prep, Ray for downstream training.

Q: Do I need Anyscale? A: No — Ray is fully open source and runs on your own infrastructure (laptop, EC2, Kubernetes, KubeRay). Anyscale offers a managed service if you don't want to operate clusters.

Q: How does it scale? A: From single-machine multiprocessing to thousands of nodes with the same code. The autoscaler talks to AWS/GCP/Azure or KubeRay (Kubernetes operator) to add/remove workers based on demand.

Sources

GitHub: https://github.com/ray-project/ray
Docs: https://docs.ray.io
Company: Anyscale
License: Apache-2.0

Ray — Distributed Computing for Python and AI Workloads

Introduction

What Ray Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

AutoGluon — AutoML for Tabular, Time-Series, Text, and Image Data

Text Generation Inference (TGI) — Hugging Face Production LLM Server

Fooocus — Focus on Prompting and Generating, Not the Tooling