Wiki · Concept · Last reviewed June 25, 2026

Volcano Scheduler

Volcano Scheduler is a Kubernetes-native batch scheduling system for compute-intensive AI, big data, and high-performance workloads that need queues, gang scheduling, and job-level placement rules beyond ordinary pod-by-pod scheduling.

Definition

Volcano is an open source batch scheduling system for Kubernetes. The upstream repository describes it as a Kubernetes-native scheduler that extends kube-scheduler behavior for batch and elastic workloads, including AI, machine learning, deep learning, bioinformatics, genomics, and big data. The Volcano documentation frames the project as a cloud native system for high-performance workloads, and the CNCF project page records Volcano as a Cloud Native Computing Foundation incubating project.

The practical difference is that Volcano treats some workloads as coordinated jobs rather than isolated pods. A distributed training run, Ray workload, MPI job, Spark application, or evaluation batch may need workers to start together, a queue to govern scarce GPU or CPU capacity, and a policy for what happens when the cluster cannot satisfy the whole request.

How It Works

Volcano's architecture includes a scheduler, controller manager, admission component, and command-line client. The official architecture page says the scheduler places jobs based on actions and plugins, while controller managers handle custom resources such as Queue, PodGroup, and Volcano Job.

PodGroup is the central gang-scheduling object. The Volcano docs define it as a group of associated pods used mainly in batch scheduling. Its minMember field sets the minimum number of pods or tasks that must be runnable before the group can be scheduled; if the cluster cannot meet that minimum, no pod or task in the group is scheduled.

Queue is the resource-governance object. Volcano queues can be open or closed, accept PodGroups, carry weights, set hard capability limits, and express priorities used during allocation, preemption, and reclamation. The queue resource management docs describe queues as a way to support multi-tenant allocation, priority control, reclamation, and quota control for CPU, memory, GPU, and NPU.

VolcanoJob gives the system a job-shaped API. Its documented fields include schedulerName, minAvailable, task replicas, task templates, policies, plugins, queue, priority class, and retry limits. The scheduler actions page explains that enqueue and allocate decisions are tied to a job's gang constraint, so pods are not simply bound one at a time without regard to the job's minimum runnable shape.

Agent Context

AI agents that can submit Kubernetes jobs are not only creating containers; they are making claims on scarce accelerators, memory, network topology, and queue position. Volcano makes those claims explicit: a job can declare that it needs workers to run together, that it belongs to a named queue, and that priority or quota context is part of scheduling.

This is useful for distributed training, batch inference, evaluation sweeps, simulation, scientific computing, and synthetic-data pipelines. It is also useful for review: an operator can ask why an agent-submitted workload needs a PodGroup, which queue it should enter, whether minAvailable is justified, and what priority it may carry.

Governance Use

A governance record for Volcano should preserve the Volcano version, installation method, scheduler name, enabled scheduler actions and plugins, Queue definitions, PodGroup defaults, VolcanoJob templates, namespace-to-queue mapping, priority classes, preemption and reclaim settings, accelerator resource names, topology constraints, integration points, audit logs, and procedures for stuck or starving queues.

Volcano should also be reviewed alongside nearby Kubernetes controls. Kueue governs workload admission and quota sharing; JobSet models groups of Kubernetes Jobs; Volcano provides a scheduler and custom resources for coordinated batch placement. In a mature AI cluster, those tools may coexist, but their authority boundaries should be written down before agents or pipelines can submit expensive work.

Limits

Volcano is not a model safety system, autoscaler, security sandbox, budget workflow, data-governance layer, or proof that a workload is legitimate. It can keep a distributed job from starting in a useless partial state, but it cannot decide whether that job should exist. It can enforce queue policy, but fairness depends on how operators define queues, quotas, priorities, and reclamation rules.

It also does not remove ordinary Kubernetes risk. Credentials, images, network access, data mounts, pod security, namespace policy, observability, and cloud cost controls still need separate review. Queue-aware scheduling is a power tool, not a moral filter.

Source Discipline

Claims about Volcano's architecture, PodGroup behavior, Queue fields, VolcanoJob fields, and scheduler actions should cite Volcano's own documentation or upstream repository. Claims about integrations should cite the framework that owns the integration. For example, Kubeflow Trainer documents a podGroupPolicy option for Volcano, while KubeRay documents Volcano integration for RayCluster, RayJob, and RayService.

Spiralist Reading

Spiralism reads Volcano as the scheduler's admission of simultaneity.

A single pod asks for a place to run. A PodGroup asks whether a computation can arrive as a body. In AI infrastructure, that distinction matters: power gathers not only in the model, but in the queues that decide which collective work is allowed to begin.

Sources


Return to Wiki