Wiki · Concept · Last reviewed June 25, 2026

Kubernetes HorizontalPodAutoscaler

Kubernetes HorizontalPodAutoscaler automatically changes replica counts for scalable workloads based on observed metrics.

Definition

Kubernetes HorizontalPodAutoscaler, often shortened to HPA, is a namespaced autoscaling/v2 API object and controller that automatically manages the replica count of a workload implementing the Kubernetes scale subresource. Horizontal scaling means adding or removing pod replicas; it differs from vertical scaling, which changes resources assigned to existing pods.

An HPA points at a target workload through scaleTargetRef, sets minReplicas and maxReplicas, and defines one or more metrics. Kubernetes documentation describes the HPA controller as periodically adjusting replicas to match observed resource utilization such as CPU or memory usage. With autoscaling/v2, HPA also supports memory, custom metrics, and external metrics.

How It Works

The HPA control loop observes metrics, computes a desired replica count, and writes that count to the target workload's scale subresource. The core formula compares current metric value against desired metric value and scales replicas by that ratio, with rounding and tolerance behavior to avoid small oscillations. When multiple metrics are configured, the controller evaluates each metric and chooses the largest recommended replica count, bounded by the configured minimum and maximum.

For CPU and memory utilization targets, pods need resource requests. Kubernetes calculates utilization as current usage divided by requested resources. The HPA walkthrough also notes that a cluster needs Metrics Server for resource metrics; Metrics Server collects resource metrics from kubelets and exposes them through the Kubernetes API. Custom and external metrics require the relevant metrics APIs to be available.

The behavior field can tune scale-up and scale-down with policies and stabilization windows. Kubernetes documentation also describes startup-related CPU metric handling through controller-manager flags such as --horizontal-pod-autoscaler-cpu-initialization-period and --horizontal-pod-autoscaler-initial-readiness-delay. Those settings help avoid treating warmup behavior as steady workload demand.

Agent Context

For AI infrastructure, HPA is a demand governor. Model endpoints, embedding services, retrieval APIs, queue consumers, browser-agent workers, safety classifiers, and tool servers may all experience bursty load. An HPA can add replicas when CPU, memory, request pressure, queue depth, or another exposed metric suggests that the current fleet is not enough.

The governance question is metric choice. Scaling on CPU may be reasonable for a CPU-bound service, but a model-serving endpoint may bottleneck on GPU memory, token generation latency, queue age, external rate limits, or downstream retrieval. If the metric is wrong, autoscaling can spend more money without improving service quality, or scale down while human-facing work still waits.

Governance Use

A governance-grade HPA record should preserve the target workload, owner, namespace, metric sources, thresholds, minimum and maximum replicas, scale-up and scale-down behavior, metrics provider, expected cold-start time, capacity assumptions, and rollback path. For AI services, it should identify whether the target supports production inference, evaluation, content review, data processing, or experimental agent execution.

Review should also name the cost and safety boundaries. A high maximum replica count can absorb traffic, but it can also multiply cloud spend, API calls, model invocations, and downstream data access. A low maximum can protect budgets while creating queues or degraded service. HPA policy belongs in the same review packet as ResourceQuota, PodDisruptionBudget, and incident runbooks.

Limits

HPA does not understand whether an AI request is important, abusive, lawful, or safe. It does not verify model outputs, authorize tools, inspect prompts, or allocate new nodes by itself. It changes desired replica counts for a target workload based on the metrics it can see.

It also depends on measurement. Missing or delayed metrics, absent resource requests, bad readiness probes, noisy custom metrics, or overloaded metric adapters can produce poor scaling decisions. Horizontal scaling may not help a workload constrained by a single database, rate-limited provider, shared GPU, license cap, or serialized task queue.

Source Discipline

Claims about HPA behavior should cite the Kubernetes Horizontal Pod Autoscaling concept page, the autoscaling/v2 API reference, the Autoscaling Workloads overview, and the HPA walkthrough. Claims about AI service impact should be labeled as deployment analysis, not as claims that Kubernetes evaluates model risk or social priority.

A serious HPA review should include observed metrics, target values, scale events, saturation signals, cost effects, and failure cases. The YAML alone is not evidence that scaling is useful.

Spiralist Reading

Spiralism reads HorizontalPodAutoscaler as an automated appetite. It watches pressure and asks the system to grow or shrink.

For agent fleets, that appetite needs policy supplied by operators: which demand deserves more machines, which demand should be refused, and which metric quietly becomes command.

Sources


Return to Wiki