Wiki · Concept · Last reviewed June 25, 2026

Kubernetes Karpenter

Karpenter is a Kubernetes node lifecycle and provisioning system for launching capacity that matches unschedulable pods, then removing or replacing nodes when policy says they are no longer the right fit.

Definition

Karpenter is a Kubernetes infrastructure project for node provisioning and node lifecycle management. The official concepts page says Karpenter's job is to add nodes for unschedulable pods, schedule pods on those nodes, and remove nodes when they are not needed. It only attempts pods with an Unschedulable=True status condition set by the Kubernetes scheduler.

This makes Karpenter different from a pod autoscaler. HorizontalPodAutoscaler changes replica counts. Karpenter changes the underlying node supply. In AI infrastructure, that distinction matters: model endpoints, embedding workers, evaluation jobs, and agent fleets can all create demand for specialized machines that do not yet exist in the cluster.

How It Works

NodePool is Karpenter's core policy object for groups of nodes. The NodePool documentation says it sets constraints on nodes Karpenter can create and on pods that can run on those nodes. NodePools can define taints, startup taints, zone and instance constraints, node expiration defaults, resource limits, disruption settings, and weights used when more than one NodePool matches.

NodeClass carries cloud-provider-specific settings. In the current official AWS provider documentation, each NodePool references an EC2NodeClass through spec.template.spec.nodeClassRef, and multiple NodePools can point to the same EC2NodeClass. NodeClass fields can cover settings such as AMI selection, subnets, security groups, kubelet options, user data, and network-interface choices.

NodeClaim is the object Karpenter uses to manage an individual node lifecycle with the cloud provider. Karpenter creates and deletes NodeClaims in response to pod demand and disruption needs. The NodeClaims page describes creation as a sequence of checking pod constraints, cross-referencing NodePools and NodeClasses, computing the shape and size of a NodeClaim, launching capacity, registering the node, and waiting for initialization.

Karpenter also manages disruption. Its disruption page says it sets a finalizer on each node and NodeClaim it provisions. During termination, Karpenter taints and drains the node before removing the underlying NodeClaim. Automated graceful methods include drift and consolidation; NodePool disruption budgets can rate-limit voluntary disruption.

Agent Context

AI platforms turn user actions and internal automation into bursty infrastructure demand. A single experiment sweep, retrieval rebuild, batch evaluation, notebook job, or tool-using agent can require GPU nodes, high-memory nodes, temporary storage, or a particular zone. Karpenter is relevant because it converts pod-level scheduling constraints into cloud capacity decisions.

The governance issue is translation. When an agent platform requests a node selector, affinity rule, accelerator, topology spread, or capacity type, Karpenter may turn that request into real machines and real cost. The system should be treated as part of the control surface for compute allocation, not only as background plumbing.

Governance Use

A governance record should preserve the installed Karpenter version, controller namespace, cloud-provider permissions, NodePools, NodeClasses, NodeClaims, labels, taints, startup taints, requirements, limits, weights, disruption budgets, expiration settings, consolidation policy, and observed provisioned nodes. For AI workloads, it should also record which teams or systems are allowed to request scarce accelerator classes.

Review should connect Karpenter to financial and safety boundaries. Which workloads may trigger spot capacity? Which may request reserved or on-demand capacity? Which NodePools are dedicated to model serving, sandboxed agents, batch research, or regulated data processing? Which pods carry disruption controls such as PodDisruptionBudgets or karpenter.sh/do-not-disrupt?

Limits

Karpenter does not decide whether an AI job is legitimate, safe, licensed, or socially valuable. It does not replace Kueue for batch admission, ResourceQuota for namespace budgets, Pod Security Admission for runtime hardening, NetworkPolicy for traffic boundaries, or audit logging for accountability.

It can also amplify mistakes. A broad NodePool, permissive cloud role, missing limit, bad node selector, or aggressive consolidation policy can turn a small configuration error into spend, churn, or interruption. Good Karpenter governance treats node provisioning as policy, not just optimization.

Source Discipline

Claims about Karpenter's pod-triggered provisioning, NodePools, NodeClasses, NodeClaims, disruption, finalizers, consolidation, drift, and disruption budgets should cite official Karpenter documentation. Claims about AI cost or governance effects should be framed as deployment analysis, not as claims that Karpenter understands model behavior.

Spiralist Reading

Spiralism reads Karpenter as a machine for making appetite material.

A pending pod is a desire with a shape: memory, accelerator, zone, price class, interruption tolerance. Karpenter asks the cloud to give that desire a body. Governance begins when the institution can say which desires deserve machines.

Sources


Return to Wiki