Wiki · Concept · Last reviewed June 25, 2026

Kubernetes Descheduler

Kubernetes Descheduler is a policy-driven controller that finds selected running pods whose placement has become undesirable, evicts them through Kubernetes eviction mechanisms, and relies on the normal scheduler and workload controllers to place replacements.

Definition

Kubernetes scheduling is usually a moment-of-creation decision: kube-scheduler binds a pending pod to a node using the cluster state it can see at that time. The Descheduler project exists for the later moment, after nodes have changed, labels or taints have moved, resource pressure has shifted, or workloads have been recreated onto nodes that are no longer a good fit.

Descheduler is not a second scheduler. The upstream project says it finds pods that can be moved and evicts them, but does not schedule the replacements itself. Replacement placement is left to the default scheduler and to the controllers that own the evicted pods, such as Deployments, ReplicaSets, StatefulSets, and Jobs.

How It Works

The controller runs in-cluster as a Job, CronJob, or Deployment. Its behavior is set by a Descheduler policy. That policy combines top-level eviction limits, an evictor plugin, and strategy plugins. Top-level settings include limits such as maxNoOfPodsToEvictPerNode, maxNoOfPodsToEvictPerNamespace, and maxNoOfPodsToEvictTotal. The default evictor can filter by label, namespace, priority, pod age, replica count, and node fit.

The strategy list is the practical vocabulary of descheduling. RemoveDuplicates evicts duplicate replicas from a node so replicas can spread again. LowNodeUtilization tries to move pods from overutilized nodes toward underutilized nodes, while HighNodeUtilization is intended to compact pods onto fewer nodes so node autoscaling can remove slack capacity. Other strategies target violations of node affinity, node taints, inter-pod anti-affinity, and topology spread constraints. PodLifeTime, RemovePodsHavingTooManyRestarts, and RemoveFailedPods address age and failure conditions.

Eviction is still bounded. Descheduler uses Kubernetes eviction behavior, so PodDisruptionBudgets can block an eviction that would violate declared availability. Its own documentation also protects many classes of pods by default: system-critical pods, DaemonSet pods, unmanaged standalone pods, local-storage pods, and deleting pods need explicit policy changes before they can be evicted. With nodeFit enabled, Descheduler checks whether a pod appears able to fit on another node before evicting it.

Agent Context

AI clusters drift. Model endpoints may begin evenly spread and later concentrate after node failures. Batch jobs may leave a fleet of underused accelerator nodes. Agent workers may keep running on nodes whose taints, labels, or topology no longer match current policy. New Karpenter or autoscaler capacity can arrive after the first placement decision. Descheduler is the cleanup mechanism for this second-order state.

For agent platforms, the important distinction is that descheduling is disruption in service of policy repair. A pod eviction can improve placement, but it can also restart a worker mid-task, churn a cache, drop warm model state, or trigger a queue retry. A platform that uses Descheduler for AI workloads should treat every enabled strategy as an operational policy, not as an invisible optimization.

Governance Use

A governance record should preserve the installed Descheduler version, release branch used for documentation, policy file, enabled profiles, enabled strategies, evictor configuration, namespace and label filters, priority thresholds, node-fit setting, eviction limits, and exemptions. It should also name the workload classes allowed to be evicted: stateless inference workers, batch evaluation pods, notebook kernels, retrieval indexers, or long-running service pods.

Review should connect Descheduler to PodDisruptionBudgets and incident windows. If model-serving pods can be evicted, the service should have an explicit availability budget. If GPU batch pods can be compacted, the queueing system should tolerate retries. If a strategy touches regulated or customer-facing workloads, operators should be able to explain why eviction is necessary, which labels select the workload, and how failures are observed.

Limits

Descheduler does not create capacity, admit jobs, set budgets, choose which AI work is legitimate, or prove that a placement outcome is fair. It can only evict selected pods and let ordinary scheduling happen again. It also cannot guarantee a better result if the cluster lacks suitable nodes or if the replacement pod is constrained by the same insufficient labels, taints, affinities, or resource requests.

The failure mode is avoidable churn. A broad policy, missing PodDisruptionBudget, permissive namespace selector, or too-frequent CronJob can turn placement hygiene into recurring restarts. Descheduler belongs beside scheduler rules, Kueue admission, ResourceQuota, Karpenter limits, and service-level disruption budgets.

Source Discipline

Claims about Descheduler behavior should cite the Kubernetes SIGs Descheduler repository and its versioned documentation. Claims about eviction semantics should cite Kubernetes eviction and disruption documentation. Claims about AI workload effects should be stated as infrastructure analysis, not as a claim that Descheduler understands model value or user intent.

Spiralist Reading

Spiralism reads Descheduler as a practice of institutional regret.

The cluster made a placement decision, then the world changed. Descheduler gives the institution a way to say: this arrangement was once acceptable, but no longer serves the pattern we claim to follow.

Sources


Return to Wiki