Wiki · Concept · Last reviewed June 25, 2026

Kubernetes Vertical Pod Autoscaler

Kubernetes Vertical Pod Autoscaler is an add-on autoscaling system that recommends and, depending on policy, applies CPU and memory request changes to pods so workloads can be right-sized from observed use.

Category: Concept Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: Kubernetes, Vertical Pod Autoscaler, resource requests, AI infrastructure, governance

Definition

Kubernetes Vertical Pod Autoscaler, or VPA, is a system for adjusting pod resource requests over time. Kubernetes documentation defines vertical scaling as assigning more resources, such as CPU or memory, to pods that are already running for a workload. This differs from horizontal scaling, where Kubernetes deploys more pods to distribute demand.

The VPA is implemented as a Kubernetes API resource and controller, but it is not part of the core Kubernetes API in the way HorizontalPodAutoscaler is. Kubernetes documentation says the VerticalPodAutoscaler resource is a Custom Resource Definition, with stable API version autoscaling.k8s.io/v1, and must be installed separately.

How It Works

The VPA has three cooperating components. The recommender analyzes current and historical CPU and memory usage and writes recommendations into the VPA resource status. The updater compares running pods with those recommendations and, if policy allows, either evicts pods so controllers recreate them with new requests or performs in-place updates when supported. The admission controller is a mutating webhook that applies recommendations to new or recreated pods.

Kubernetes documentation says the recommender uses the targetRef to find the target workload, selects pods through that workload's selector labels, reads metrics through the resource metrics API, and stores target, lower-bound, and upper-bound recommendations in .status.recommendation. The VPA therefore depends on a metrics source such as Metrics Server, which is not deployed by default in many clusters.

Update mode determines the disruption surface. Off records recommendations without applying them. Initial applies recommendations only when pods are first created. Recreate evicts pods whose requests differ significantly from recommendations so replacements can be admitted with new requests. InPlaceOrRecreate attempts an in-place update and falls back to eviction when needed. InPlace is documented as an alpha VPA 1.7.0 mode requiring Kubernetes 1.33 or later plus feature gates. The older Auto mode is deprecated since VPA 1.4.0 and is currently an alias for Recreate.

Agent Context

AI workloads often have resource profiles that are hard to size by hand. A model-serving sidecar may need more memory after a tokenizer change. A retrieval worker may over-request CPU after an index rebuild finishes. An evaluation job may show peak memory only during a narrow phase. VPA makes those patterns visible as recommended requests rather than leaving them as tribal knowledge inside deployment manifests.

For agent infrastructure, VPA is especially relevant where many workloads are generated by automation. A deployment bot that copies old requests into every Job or Deployment can create waste, pending pods, or evictions elsewhere in the cluster. VPA recommendations can become a feedback loop from observed runtime behavior back into admission, scheduling, quota, and cost review.

Governance Use

A governance record should preserve the VPA version, installation source, target workload, update mode, resource policies, min and max allowed values, controlled resources, recommendation history, metrics source, PodDisruptionBudgets, namespace quotas, and observed evictions or in-place resizes. If VPA affects model-serving or agent workloads, the record should also identify the workload owner and the service-level impact of request changes.

Review should distinguish recommendation from enforcement. Running VPA in Off mode can support sizing analysis without runtime disruption. Running Recreate or fallback eviction modes changes availability because pods may be terminated and recreated. That choice belongs beside PodDisruptionBudget, Cluster Autoscaler, ResourceQuota, Kueue, and PriorityClass policy.

Limits

VPA does not evaluate model quality, safety, fairness, or data rights. It does not know whether a workload is an AI model, an agent worker, or an ordinary service. Its evidence is resource usage and policy configuration.

It also has Kubernetes-specific boundaries. The upstream VPA README says VPA is not currently compatible with workloads that define pod-level resources stanzas, and describes failure cases where admission would exceed pod-level limits or requests. VPA can improve scheduling evidence, but it can also trigger disruption, interact poorly with quotas, or resize a workload beyond available node capacity unless surrounding controls are maintained.

Source Discipline

Claims about vertical scaling, CRD status, component roles, update modes, metrics, and PodDisruptionBudget handling should cite Kubernetes vertical pod autoscaling documentation and the upstream VPA README or components documentation. Claims about AI systems should be framed as deployment governance inferences.

Spiralist Reading

Spiralism reads VPA as a confession that manifests are guesses.

A resource request pretends to know the appetite of a future process. VPA watches the process eat and asks whether the promise should be rewritten.

Sources

Kubernetes Documentation, Vertical Pod Autoscaling, reviewed June 25, 2026.
Kubernetes Autoscaler, Vertical Pod Autoscaler README, reviewed June 25, 2026.
Kubernetes Autoscaler, Vertical Pod Autoscaler Components, reviewed June 25, 2026.
Kubernetes Autoscaler, Autoscaling components for Kubernetes, reviewed June 25, 2026.

Return to Wiki