Kubernetes Priority & Preemption — Critical Workloads First
In this tutorial, you'll learn about Kubernetes Priority & Preemption. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Kubernetes priority and preemption ensure critical pods are scheduled before lower-priority workloads, potentially preempting (evicting) lower-priority pods to free resources.
What You'll Learn
You'll master priority classes and preemption — defining PriorityClass objects, assigning priorities to pods, understanding the preemption algorithm, handling graceful termination of preempted pods, and production configuration strategies.
Why This Problem Matters
Without priority, all pods compete equally for resources. A burst of batch jobs can prevent critical system pods (DNS, networking, monitoring) from running. Priority classes ensure that at any node shortage, the most important workloads run first.
Real-World Use
Doda Browser's Kubernetes cluster uses priority classes across four tiers: cluster-critical (1000000000) for DNS and networking, production (10000) for user-facing services, batch (1000) for analytics jobs, and low-priority (100) for development workloads.
Priority and Preemption Flow
flowchart TB
High[High Priority Pod] --> Schedule{Can schedule?}
Schedule -->|Yes| Run[Pod runs normally]
Schedule -->|No resources| Preempt{Can preempt?}
subgraph PreemptionProcess
Preempt --> Find[Find nodes with
lower priority pods]
Find --> Select[Select node with
best victim]
Select --> Evict[Send SIGTERM to
lower priority pods]
Evict --> Free[Resources freed]
Free --> Bind[High priority pod
scheduled]
end
Preempt -->|No preemption possible| Pending[Pod stays Pending]
subgraph GracefulShutdown
Evict --> Terminate[Preempted pods
terminationGracePeriodSeconds]
Terminate --> Cleanup[Cleanup and exit]
end
PriorityClass Configuration
# priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: cluster-critical
value: 1000000000
globalDefault: false
description: "Critical cluster infrastructure (DNS, networking)"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production
value: 10000
globalDefault: false
description: "Production user-facing services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch
value: 1000
globalDefault: false
description: "Batch processing and analytics"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "Development and testing workloads"
kubectl apply -f priority-classes.yaml
kubectl get priorityclass
Expected output:
NAME VALUE GLOBAL-DEFAULT AGE
cluster-critical 1000000000 false 10s
production 10000 false 10s
batch 1000 false 10s
low-priority 100 false 10s
Assigning Priority to Pods
apiVersion: v1
kind: Pod
metadata:
name: critical-dns
spec:
priorityClassName: cluster-critical
containers:
- name: coredns
image: coredns/coredns:latest
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
template:
spec:
priorityClassName: production
containers:
- name: app
image: nginx
---
apiVersion: batch/v1
kind: Job
metadata:
name: data-analysis
spec:
template:
spec:
priorityClassName: batch
containers:
- name: job
image: python:3.12
command: ["python", "analyze.py"]
restartPolicy: Never
Preemption Simulation
import random
import time
class Node:
def __init__(self, name: str, cpu: int, memory: int):
self.name = name
self.cpu = cpu
self.memory = memory
self.cpu_used = 0
self.memory_used = 0
self.pods = []
def available_cpu(self):
return self.cpu - self.cpu_used
def available_memory(self):
return self.memory - self.memory_used
def can_fit(self, cpu_req: int, mem_req: int) -> bool:
return (self.available_cpu() >= cpu_req
and self.available_memory() >= mem_req)
def schedule(self, pod):
self.cpu_used += pod.cpu_request
self.memory_used += pod.memory_request
self.pods.append(pod)
def get_preemption_candidates(self, needed_cpu: int,
needed_mem: int,
min_priority: int) -> list:
candidates = [p for p in self.pods
if p.priority < min_priority]
candidates.sort(key=lambda p: p.priority)
freed_cpu = 0
freed_mem = 0
victims = []
for p in candidates:
freed_cpu += p.cpu_request
freed_mem += p.memory_request
victims.append(p)
if freed_cpu >= needed_cpu and freed_mem >= needed_mem:
break
return victims if freed_cpu >= needed_cpu else []
class Pod:
def __init__(self, name: str, cpu: int, mem: int, priority: int):
self.name = name
self.cpu_request = cpu
self.memory_request = mem
self.priority = priority
self.node = None
class SchedulerWithPreemption:
def __init__(self, nodes: list):
self.nodes = nodes
def schedule(self, pod: Pod) -> str:
for node in self.nodes:
if node.can_fit(pod.cpu_request, pod.memory_request):
node.schedule(pod)
pod.node = node
return f"Scheduled on {node.name}"
# Try preemption
needed_cpu = pod.cpu_request
needed_mem = pod.memory_request
for node in sorted(self.nodes,
key=lambda n: n.available_cpu(),
reverse=True):
victims = node.get_preemption_candidates(
needed_cpu, needed_mem, pod.priority
)
if victims:
for v in victims:
print(f" Preempting {v.name} (pri {v.priority}) "
f"from {node.name}")
node.cpu_used -= v.cpu_request
node.memory_used -= v.memory_request
node.pods.remove(v)
node.schedule(pod)
pod.node = node
return f"Scheduled on {node.name} (after preemption)"
return "Pending (no resources available)"
nodes = [
Node("node-a", 8, 16384),
Node("node-b", 8, 16384),
]
scheduler = SchedulerWithPreemption(nodes)
pods = [
Pod("batch-1", 4, 8192, 1000),
Pod("batch-2", 4, 8192, 1000),
Pod("batch-3", 4, 8192, 1000),
Pod("critical", 4, 8192, 10000),
]
for p in pods:
result = scheduler.schedule(p)
print(f"{p.name:>12} (pri {p.priority:>5}): {result}")
Expected output:
batch-1 (pri 1000): Scheduled on node-a
batch-2 (pri 1000): Scheduled on node-b
batch-3 (pri 1000): Pending (no resources available)
critical (pri 10000): Preempting batch-1 (pri 1000) from node-a
critical (pri 10000): Scheduled on node-a (after preemption)
Non-Preempting Priority
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-non-preempting
value: 50000
preemptionPolicy: Never # Don't preempt, just get priority in queues
description: "High priority but will not evict running pods"
Priority Inversion Prevention
Priority inversion occurs when a high-priority pod waits for a resource held by a low-priority pod. Kubernetes mitigates this through:
- Preemption: high-priority pods can evict low-priority ones
- Priority-based scheduling queue: pods are scheduled in priority order
- Pod Disruption Budgets: limit how many pods can be evicted simultaneously
# Prevent excessive preemption with PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: batch-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: batch-worker
Priority and Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "40"
requests.memory: 80Gi
limits.cpu: "80"
limits.memory: 160Gi
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- production
Monitoring Preemption Events
# Watch for preemption events
kubectl get events --field-selector reason=Preempted
# Or watch all scheduling events
kubectl get events --field-selector involvedObject.kind=Pod \
--sort-by=.lastTimestamp
Expected output:
LAST SEEN TYPE REASON OBJECT MESSAGE
2m Normal Preempted pod/batch-worker-5 Preempted by critical-app-1
import kubernetes
from kubernetes import client, config
class PreemptionMonitor:
def __init__(self):
config.load_kubeconfig()
self.v1 = client.CoreV1Api()
def get_preemption_events(self, namespace: str = None) -> list:
field_selector = "reason=Preempted"
if namespace:
events = self.v1.list_namespaced_event(
namespace, field_selector=field_selector
)
else:
events = self.v1.list_event_for_all_namespaces(
field_selector=field_selector
)
preemptions = []
for event in events.items:
preemptions.append({
"time": event.last_timestamp,
"preempted_pod": event.involved_object.name,
"message": event.message,
"namespace": event.involved_object.namespace,
})
return preemptions
def report(self):
events = self.get_preemption_events()
print(f"Found {len(events)} preemption events:")
for e in events:
print(f" [{e['time']}] {e['preempted_pod']} in "
f"{e['namespace']}: {e['message'][:80]}...")
Common Mistakes
1. Using Default Priority (0) for Critical System Pods
Pods with default priority (0) can be preempted by any pod with a non-zero priority. Critical system components should have priority 1000000+ to ensure they always run.
2. Setting Priority Too Close Together
Priorities of 1000, 1001, 1002 create unclear hierarchy. Use logarithmic spacing: 100, 1000, 10000, 100000, 1000000. Leave gaps for future tiers.
3. Forgetting globalDefault
Setting globalDefault: true on a PriorityClass makes it the default for all pods without priorityClassName. Use with caution — it affects all existing and new namespaces.
4. No PodDisruptionBudget for Preemptable Workloads
Without PDB, a single high-priority pod can preempt all batch workers, causing complete job failure. Set minAvailable or maxUnavailable on batch workloads.
5. PreemptionPolicy: Never Without Understanding
Setting preemptionPolicy: Never means the pod won't preempt others but can still be preempted. It only affects scheduling priority, not runtime priority. This is useful for workloads that should be scheduled fairly but shouldn't cause evictions.
6. Ignoring Preemption's Impact on Running Workloads
Preemption terminates pods, which may be mid-Transaction. Critical preemptable workloads should handle SIGTERM gracefully with database Transaction rollback or checkpointing.
7. Priority Without Monitoring
Without monitoring preemption events, you won't know that batch jobs are being preempted. Set up alerts on Preempted events to detect excessive preemption.
Practice Questions
1. How does Kubernetes preemption work?
When a high-priority pod can't be scheduled, the scheduler finds nodes with lower-priority pods whose resources, when freed, would fit the high-priority pod. It selects victims, sends them SIGTERM, and binds the high-priority pod once resources are freed.
2. What is the difference between preemption and disruption budgets?
Preemption is an active eviction of pods to make room. Disruption budgets (PDBs) protect workloads from being fully disrupted. PDBs limit how many pods of a service can be down simultaneously, including during preemption.
3. What happens to preempted pods?
Preempted pods receive a SIGTERM and have terminationGracePeriodSeconds to shut down gracefully. They enter Terminating state and are not automatically rescheduled (unless part of a Deployment, StatefulSet, or Job).
4. Can a pod be preempted by a pod with the same priority?
No. The scheduler only preempts pods with strictly lower priority. Pods with equal priority are not preempted by each other. This prevents priority inversion at the same level.
5. Challenge: Design a priority Strategy for a multi-tenant SaaS platform.
The platform hosts customer workloads, internal CI/CD, monitoring, batch analytics, and development environments. Each has different criticality. Customer workloads must never be preempted, but batch jobs can be. Design a priority class hierarchy (5+ levels) with preemption policies and PDBs that ensure:
- Customer pods are never preempted
- CI/CD pipelines complete within SLA
- Batch jobs fill all remaining capacity but can be preempted
- Monitoring tools always run
Mini Project: Priority-Aware Scheduler
class PriorityScheduler:
def __init__(self, nodes: list, priority_classes: dict):
self.nodes = nodes
self.priority_classes = priority_classes
self.queue = []
def submit_pod(self, pod: dict):
name = pod["name"]
priority_name = pod.get("priorityClassName", "default")
priority = self.priority_classes.get(priority_name, 0)
self.queue.append({
"name": name,
"cpu": pod["cpu"],
"mem": pod["mem"],
"priority": priority
})
self.queue.sort(key=lambda p: p["priority"], reverse=True)
def schedule_all(self) -> list:
results = []
while self.queue:
pod = self.queue.pop(0)
best_node = None
for node in self.nodes:
if node.can_fit(pod["cpu"], pod["mem"]):
if not best_node or node.available_cpu() < best_node.available_cpu():
best_node = node
if best_node:
best_node.schedule_type(pod["cpu"], pod["mem"])
results.append((pod["name"], best_node.name, "scheduled"))
else:
results.append((pod["name"], "none", "pending"))
return results
nodes = [Node("node-a", 4, 8192), Node("node-b", 4, 8192)]
sched = PriorityScheduler(nodes,
{"critical": 10000, "production": 1000, "batch": 100, "default": 0})
for i in range(6):
sched.submit_pod({"name": f"batch-{i}", "cpu": 2, "mem": 2048,
"priorityClassName": "batch"})
sched.submit_pod({"name": "critical-web", "cpu": 2, "mem": 2048,
"priorityClassName": "critical"})
results = sched.schedule_all()
for name, node, status in results:
print(f"{name:>15} -> {node:>8} ({status})")
Expected output:
critical-web -> node-a (scheduled)
batch-0 -> node-b (scheduled)
batch-1 -> none (pending)
batch-2 -> none (pending)
batch-3 -> none (pending)
batch-4 -> none (pending)
batch-5 -> none (pending)
FAQ
What's Next
Congratulations on completing this priority and preemption guide! Here's where to go from here:
- Practice daily — Define priority classes and observe scheduling behavior
- Build a project — Set up preemption monitoring and alerts
- Explore related topics — Pod Disruption Budgets, cluster autoscaler priority, descheduler
- Join the community — Share your priority strategies and get feedback
Remember: every expert was once a beginner. Keep prioritizing!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro