Skip to content

Kubernetes Priority & Preemption — Critical Workloads First

DodaTech Updated 2026-06-24 9 min read

In this tutorial, you'll learn about Kubernetes Priority & Preemption. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Kubernetes priority and preemption ensure critical pods are scheduled before lower-priority workloads, potentially preempting (evicting) lower-priority pods to free resources.

What You'll Learn

You'll master priority classes and preemption — defining PriorityClass objects, assigning priorities to pods, understanding the preemption algorithm, handling graceful termination of preempted pods, and production configuration strategies.

Why This Problem Matters

Without priority, all pods compete equally for resources. A burst of batch jobs can prevent critical system pods (DNS, networking, monitoring) from running. Priority classes ensure that at any node shortage, the most important workloads run first.

Real-World Use

Doda Browser's Kubernetes cluster uses priority classes across four tiers: cluster-critical (1000000000) for DNS and networking, production (10000) for user-facing services, batch (1000) for analytics jobs, and low-priority (100) for development workloads.

Priority and Preemption Flow

flowchart TB
  High[High Priority Pod] --> Schedule{Can schedule?}
  Schedule -->|Yes| Run[Pod runs normally]
  Schedule -->|No resources| Preempt{Can preempt?}
  
  subgraph PreemptionProcess
    Preempt --> Find[Find nodes with
lower priority pods] Find --> Select[Select node with
best victim] Select --> Evict[Send SIGTERM to
lower priority pods] Evict --> Free[Resources freed] Free --> Bind[High priority pod
scheduled] end Preempt -->|No preemption possible| Pending[Pod stays Pending] subgraph GracefulShutdown Evict --> Terminate[Preempted pods
terminationGracePeriodSeconds] Terminate --> Cleanup[Cleanup and exit] end

PriorityClass Configuration

# priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: cluster-critical
value: 1000000000
globalDefault: false
description: "Critical cluster infrastructure (DNS, networking)"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production
value: 10000
globalDefault: false
description: "Production user-facing services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch
value: 1000
globalDefault: false
description: "Batch processing and analytics"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Development and testing workloads"
kubectl apply -f priority-classes.yaml
kubectl get priorityclass

Expected output:

NAME                      VALUE        GLOBAL-DEFAULT   AGE
cluster-critical          1000000000   false            10s
production                10000        false            10s
batch                     1000         false            10s
low-priority              100          false            10s

Assigning Priority to Pods

apiVersion: v1
kind: Pod
metadata:
  name: critical-dns
spec:
  priorityClassName: cluster-critical
  containers:
    - name: coredns
      image: coredns/coredns:latest
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    spec:
      priorityClassName: production
      containers:
        - name: app
          image: nginx
---
apiVersion: batch/v1
kind: Job
metadata:
  name: data-analysis
spec:
  template:
    spec:
      priorityClassName: batch
      containers:
        - name: job
          image: python:3.12
          command: ["python", "analyze.py"]
      restartPolicy: Never

Preemption Simulation

import random
import time

class Node:
    def __init__(self, name: str, cpu: int, memory: int):
        self.name = name
        self.cpu = cpu
        self.memory = memory
        self.cpu_used = 0
        self.memory_used = 0
        self.pods = []

    def available_cpu(self):
        return self.cpu - self.cpu_used

    def available_memory(self):
        return self.memory - self.memory_used

    def can_fit(self, cpu_req: int, mem_req: int) -> bool:
        return (self.available_cpu() >= cpu_req
                and self.available_memory() >= mem_req)

    def schedule(self, pod):
        self.cpu_used += pod.cpu_request
        self.memory_used += pod.memory_request
        self.pods.append(pod)

    def get_preemption_candidates(self, needed_cpu: int,
                                   needed_mem: int,
                                   min_priority: int) -> list:
        candidates = [p for p in self.pods
                      if p.priority < min_priority]
        candidates.sort(key=lambda p: p.priority)

        freed_cpu = 0
        freed_mem = 0
        victims = []
        for p in candidates:
            freed_cpu += p.cpu_request
            freed_mem += p.memory_request
            victims.append(p)
            if freed_cpu >= needed_cpu and freed_mem >= needed_mem:
                break

        return victims if freed_cpu >= needed_cpu else []

class Pod:
    def __init__(self, name: str, cpu: int, mem: int, priority: int):
        self.name = name
        self.cpu_request = cpu
        self.memory_request = mem
        self.priority = priority
        self.node = None

class SchedulerWithPreemption:
    def __init__(self, nodes: list):
        self.nodes = nodes

    def schedule(self, pod: Pod) -> str:
        for node in self.nodes:
            if node.can_fit(pod.cpu_request, pod.memory_request):
                node.schedule(pod)
                pod.node = node
                return f"Scheduled on {node.name}"

        # Try preemption
        needed_cpu = pod.cpu_request
        needed_mem = pod.memory_request

        for node in sorted(self.nodes,
                           key=lambda n: n.available_cpu(),
                           reverse=True):
            victims = node.get_preemption_candidates(
                needed_cpu, needed_mem, pod.priority
            )
            if victims:
                for v in victims:
                    print(f"  Preempting {v.name} (pri {v.priority}) "
                          f"from {node.name}")
                    node.cpu_used -= v.cpu_request
                    node.memory_used -= v.memory_request
                    node.pods.remove(v)

                node.schedule(pod)
                pod.node = node
                return f"Scheduled on {node.name} (after preemption)"

        return "Pending (no resources available)"

nodes = [
    Node("node-a", 8, 16384),
    Node("node-b", 8, 16384),
]
scheduler = SchedulerWithPreemption(nodes)
pods = [
    Pod("batch-1", 4, 8192, 1000),
    Pod("batch-2", 4, 8192, 1000),
    Pod("batch-3", 4, 8192, 1000),
    Pod("critical", 4, 8192, 10000),
]

for p in pods:
    result = scheduler.schedule(p)
    print(f"{p.name:>12} (pri {p.priority:>5}): {result}")

Expected output:

     batch-1 (pri  1000): Scheduled on node-a
     batch-2 (pri  1000): Scheduled on node-b
     batch-3 (pri  1000): Pending (no resources available)
   critical (pri 10000): Preempting batch-1 (pri 1000) from node-a
   critical (pri 10000): Scheduled on node-a (after preemption)

Non-Preempting Priority

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-non-preempting
value: 50000
preemptionPolicy: Never  # Don't preempt, just get priority in queues
description: "High priority but will not evict running pods"

Priority Inversion Prevention

Priority inversion occurs when a high-priority pod waits for a resource held by a low-priority pod. Kubernetes mitigates this through:

  1. Preemption: high-priority pods can evict low-priority ones
  2. Priority-based scheduling queue: pods are scheduled in priority order
  3. Pod Disruption Budgets: limit how many pods can be evicted simultaneously
# Prevent excessive preemption with PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: batch-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: batch-worker

Priority and Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.cpu: "80"
    limits.memory: 160Gi
  scopeSelector:
    matchExpressions:
      - operator: In
        scopeName: PriorityClass
        values:
          - production

Monitoring Preemption Events

# Watch for preemption events
kubectl get events --field-selector reason=Preempted

# Or watch all scheduling events
kubectl get events --field-selector involvedObject.kind=Pod \
  --sort-by=.lastTimestamp

Expected output:

LAST SEEN   TYPE      REASON      OBJECT                  MESSAGE
2m          Normal    Preempted   pod/batch-worker-5      Preempted by critical-app-1
import kubernetes
from kubernetes import client, config

class PreemptionMonitor:
    def __init__(self):
        config.load_kubeconfig()
        self.v1 = client.CoreV1Api()

    def get_preemption_events(self, namespace: str = None) -> list:
        field_selector = "reason=Preempted"
        if namespace:
            events = self.v1.list_namespaced_event(
                namespace, field_selector=field_selector
            )
        else:
            events = self.v1.list_event_for_all_namespaces(
                field_selector=field_selector
            )

        preemptions = []
        for event in events.items:
            preemptions.append({
                "time": event.last_timestamp,
                "preempted_pod": event.involved_object.name,
                "message": event.message,
                "namespace": event.involved_object.namespace,
            })
        return preemptions

    def report(self):
        events = self.get_preemption_events()
        print(f"Found {len(events)} preemption events:")
        for e in events:
            print(f"  [{e['time']}] {e['preempted_pod']} in "
                  f"{e['namespace']}: {e['message'][:80]}...")

Common Mistakes

1. Using Default Priority (0) for Critical System Pods

Pods with default priority (0) can be preempted by any pod with a non-zero priority. Critical system components should have priority 1000000+ to ensure they always run.

2. Setting Priority Too Close Together

Priorities of 1000, 1001, 1002 create unclear hierarchy. Use logarithmic spacing: 100, 1000, 10000, 100000, 1000000. Leave gaps for future tiers.

3. Forgetting globalDefault

Setting globalDefault: true on a PriorityClass makes it the default for all pods without priorityClassName. Use with caution — it affects all existing and new namespaces.

4. No PodDisruptionBudget for Preemptable Workloads

Without PDB, a single high-priority pod can preempt all batch workers, causing complete job failure. Set minAvailable or maxUnavailable on batch workloads.

5. PreemptionPolicy: Never Without Understanding

Setting preemptionPolicy: Never means the pod won't preempt others but can still be preempted. It only affects scheduling priority, not runtime priority. This is useful for workloads that should be scheduled fairly but shouldn't cause evictions.

6. Ignoring Preemption's Impact on Running Workloads

Preemption terminates pods, which may be mid-Transaction. Critical preemptable workloads should handle SIGTERM gracefully with database Transaction rollback or checkpointing.

7. Priority Without Monitoring

Without monitoring preemption events, you won't know that batch jobs are being preempted. Set up alerts on Preempted events to detect excessive preemption.

Practice Questions

1. How does Kubernetes preemption work?

When a high-priority pod can't be scheduled, the scheduler finds nodes with lower-priority pods whose resources, when freed, would fit the high-priority pod. It selects victims, sends them SIGTERM, and binds the high-priority pod once resources are freed.

2. What is the difference between preemption and disruption budgets?

Preemption is an active eviction of pods to make room. Disruption budgets (PDBs) protect workloads from being fully disrupted. PDBs limit how many pods of a service can be down simultaneously, including during preemption.

3. What happens to preempted pods?

Preempted pods receive a SIGTERM and have terminationGracePeriodSeconds to shut down gracefully. They enter Terminating state and are not automatically rescheduled (unless part of a Deployment, StatefulSet, or Job).

4. Can a pod be preempted by a pod with the same priority?

No. The scheduler only preempts pods with strictly lower priority. Pods with equal priority are not preempted by each other. This prevents priority inversion at the same level.

5. Challenge: Design a priority Strategy for a multi-tenant SaaS platform.

The platform hosts customer workloads, internal CI/CD, monitoring, batch analytics, and development environments. Each has different criticality. Customer workloads must never be preempted, but batch jobs can be. Design a priority class hierarchy (5+ levels) with preemption policies and PDBs that ensure:

  • Customer pods are never preempted
  • CI/CD pipelines complete within SLA
  • Batch jobs fill all remaining capacity but can be preempted
  • Monitoring tools always run

Mini Project: Priority-Aware Scheduler

class PriorityScheduler:
    def __init__(self, nodes: list, priority_classes: dict):
        self.nodes = nodes
        self.priority_classes = priority_classes
        self.queue = []

    def submit_pod(self, pod: dict):
        name = pod["name"]
        priority_name = pod.get("priorityClassName", "default")
        priority = self.priority_classes.get(priority_name, 0)
        self.queue.append({
            "name": name,
            "cpu": pod["cpu"],
            "mem": pod["mem"],
            "priority": priority
        })
        self.queue.sort(key=lambda p: p["priority"], reverse=True)

    def schedule_all(self) -> list:
        results = []
        while self.queue:
            pod = self.queue.pop(0)
            best_node = None

            for node in self.nodes:
                if node.can_fit(pod["cpu"], pod["mem"]):
                    if not best_node or node.available_cpu() < best_node.available_cpu():
                        best_node = node

            if best_node:
                best_node.schedule_type(pod["cpu"], pod["mem"])
                results.append((pod["name"], best_node.name, "scheduled"))
            else:
                results.append((pod["name"], "none", "pending"))
        return results

nodes = [Node("node-a", 4, 8192), Node("node-b", 4, 8192)]
sched = PriorityScheduler(nodes,
    {"critical": 10000, "production": 1000, "batch": 100, "default": 0})

for i in range(6):
    sched.submit_pod({"name": f"batch-{i}", "cpu": 2, "mem": 2048,
                       "priorityClassName": "batch"})
sched.submit_pod({"name": "critical-web", "cpu": 2, "mem": 2048,
                   "priorityClassName": "critical"})

results = sched.schedule_all()
for name, node, status in results:
    print(f"{name:>15} -> {node:>8} ({status})")

Expected output:

   critical-web ->  node-a (scheduled)
        batch-0 ->  node-b (scheduled)
        batch-1 ->    none (pending)
        batch-2 ->    none (pending)
        batch-3 ->    none (pending)
        batch-4 ->    none (pending)
        batch-5 ->    none (pending)

FAQ

What is the default priority of a pod without a priorityClassName?

The default priority is 0 if no globalDefault PriorityClass exists. If a PriorityClass with globalDefault: true exists, pods without explicit priorityClassName get that priority value.

Can preemption be disabled entirely?

Yes. Set preemptionPolicy: Never on all PriorityClass definitions, or disable the preemption feature gate in the kube-scheduler. But without preemption, high-priority pods remain Pending when the cluster is full.

How does priority work with cluster autoscaling?

The cluster autoscaler considers pod priority when deciding whether to scale up. A pending high-priority pod triggers scale-up even if lower-priority pods are running. The CA also evicts lower-priority pods to make room for higher-priority ones before scaling down.

What's Next

Kubernetes StatefulSets Guide
Kubernetes Scheduling Guide
Kubernetes HPA & VPA Guide

Congratulations on completing this priority and preemption guide! Here's where to go from here:

  • Practice daily — Define priority classes and observe scheduling behavior
  • Build a project — Set up preemption monitoring and alerts
  • Explore related topics — Pod Disruption Budgets, cluster autoscaler priority, descheduler
  • Join the community — Share your priority strategies and get feedback

Remember: every expert was once a beginner. Keep prioritizing!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro