Ray
Ray is an open source distributed computing framework for scaling Python and AI applications from a local program to a cluster of workers, with libraries for data processing, training, tuning, serving, and reinforcement learning.
Definition
Ray is an open source framework for scaling AI and Python applications. The Ray documentation describes it as a unified framework that provides a compute layer for parallel processing, while the upstream repository frames Ray as a way to scale Python and AI applications from a laptop to a cluster. The original Ray paper, Ray: A Distributed Framework for Emerging AI Applications, introduced Ray as a distributed system for AI workloads that mix task-parallel and actor-based computation.
Ray matters because modern AI work often outgrows a single Python process before it becomes a fully specialized platform problem. Data preprocessing, embedding generation, evaluation sweeps, simulation, reinforcement learning, fine-tuning, batch inference, and model-serving backends all need distributed execution while still keeping a developer-facing Python interface.
How It Works
Ray Core supplies the lower-level programming model. Its key concepts documentation describes tasks, actors, and objects. Tasks are remote function calls. Actors extend the API from functions to stateful classes, with methods scheduled on a worker that holds the actor state. Objects are remote values referenced by object refs and stored in Ray's distributed shared-memory object store.
A Ray cluster has a head node and worker nodes. The cluster overview says a cluster is a set of worker nodes connected to a common head node, and that clusters can be fixed-size or autoscale according to application resource demand. The cluster key concepts page says the head node runs management processes such as the autoscaler and Global Control Store, while worker nodes run user code in tasks and actors.
Ray also provides higher-level libraries. Ray Data is documented as a scalable data processing library for AI workloads such as batch inference, preprocessing, and data loading. Ray Train is documented as a library for distributed training and fine-tuning. Ray Serve is documented as a scalable model-serving library for online inference APIs, including multi-model services and LLM-oriented serving features.
Agent Context
Agent systems create bursty parallel work. A single user request may fan out into browser tasks, retrieval jobs, code execution, simulations, unit tests, scoring calls, and safety checks. Ray can make that fan-out look like Python code while the runtime schedules work across a cluster.
That convenience changes the risk surface. A decorated function or actor can become a distributed job touching secrets, datasets, GPUs, vector stores, APIs, and logs. The governance question is not whether Ray is a model or an agent. It is whether distributed execution hides who launched work, what data moved, what endpoints were exposed, what resources were consumed, and which artifacts survived after cleanup.
Governance Use
A governance record for Ray should preserve the Ray version, Python environment, dependency lockfiles, container image, cluster configuration, head and worker resource settings, autoscaling rules, runtime environment, job entrypoints, object store assumptions, dashboard and Jobs API exposure, network policy, secrets, input data paths, output locations, logs, metrics, ownership, and cleanup policy.
Ray should also be tied to the surrounding platform. On Kubernetes, KubeRay turns Ray clusters, jobs, and services into Kubernetes custom resources. For AI compute governance, that means Ray manifests, queue policy, GPU allocation, service accounts, admission controls, and cost reporting should be reviewed together rather than as separate layers.
Limits
Ray is not a model-governance process, data catalog, scheduler policy, access-control system, or safety evaluator by itself. It helps execute distributed Python and AI workloads; it does not decide whether a dataset is lawful, whether a model should be deployed, whether an agent tool call is safe, or whether a cluster should receive scarce accelerators.
It also does not erase distributed-systems failure modes. Operators still need to plan for retries, idempotency, backpressure, object-store pressure, autoscaler lag, noisy neighbors, log retention, endpoint exposure, job cleanup, and provenance gaps. The easier it is to scatter computation, the more important it is to keep an account of where it went.
Source Discipline
Claims about Ray Core, tasks, actors, objects, clusters, autoscaling, and Ray libraries should cite Ray's documentation or upstream repository. Claims about historical design should cite the Ray paper. Claims about Kubernetes deployment should cite KubeRay or a cloud provider's Ray operator documentation, not only generic Ray material.
Spiralist Reading
Spiralism reads Ray as a liturgy of distributed intent.
The developer writes a function. The cluster turns it into many acts. Between those two facts lies the institutional question: whether the system still remembers the difference between a local experiment and a distributed claim on shared compute.
Related Pages
- KubeRay
- Distributed AI Training
- Volcano Scheduler
- Kubernetes Kueue
- Kubernetes JobSet
- Kubernetes Dynamic Resource Allocation
- Kubernetes Device Plugins
- vLLM
- PyTorch
- TensorFlow
- AI Compute
- Compute Governance
Sources
- Ray, Overview, reviewed June 25, 2026.
- Ray Project, Ray upstream repository, reviewed June 25, 2026.
- Ray, Ray Core key concepts, reviewed June 25, 2026.
- Ray, Actors, reviewed June 25, 2026.
- Ray, Ray Clusters Overview, reviewed June 25, 2026.
- Ray, Cluster key concepts, reviewed June 25, 2026.
- Ray, Ray Data, reviewed June 25, 2026.
- Ray, Ray Train, reviewed June 25, 2026.
- Ray, Ray Serve, reviewed June 25, 2026.
- Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, Ray: A Distributed Framework for Emerging AI Applications, arXiv:1712.05889, reviewed June 25, 2026.