Kubernetes Scheduling Guide — Pod Placement Control
In this tutorial, you'll learn about Kubernetes Scheduling Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Kubernetes scheduling assigns pods to nodes based on resource requirements, constraints, and policies, optimizing cluster utilization while respecting placement rules.
What You'll Learn
You'll master Kubernetes scheduling — the scheduler workflow (filter, score, bind), node selectors, taints and tolerations, node and pod affinity/anti-affinity, topology spread constraints, and custom scheduler patterns.
Why This Problem Matters
Default scheduling spreads pods randomly across nodes. Without explicit placement control, critical pods may land on under-provisioned nodes, latency-sensitive pods may be placed far apart, and resources may be fragmented inefficiently.
Real-World Use
Doda Browser's infrastructure team uses taints and tolerations to isolate GPU nodes for ML inference workloads. Node affinity ensures web servers are co-located with their local Redis cache. Pod anti-affinity prevents two replicas of the same service from sharing a node.
Scheduler Workflow
flowchart TB
Pod[Unscheduled Pod] --> Queue[Scheduling Queue]
Queue --> Filter[Scheduling Filter]
subgraph FilteringPhase
F1[Node Unschedulable?]
F2[Resource Fit?]
F3[Taints Tolerated?]
F4[Node Selector Match?]
F5[Affinity Rules?]
end
Filter --> FilteringPhase
FilteringPhase -->|Yes| Scoring[Scoring Phase]
FilteringPhase -->|No| Skip[Node Skipped]
Scoring --> Score[Score Nodes]
Score --> Bind[Bind to Highest Score]
Bind --> Kubelet[Kubelet Starts Pod]
Node Selector
The simplest scheduling constraint:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
nodeSelector:
gpu: "true" # Only nodes with this label
containers:
- name: ml-worker
image: tensorflow/tensorflow:latest-gpu
# Label a node
kubectl label node worker-2 gpu=true
# Check where the pod was scheduled
kubectl get pod gpu-pod -o wide
Expected output:
NAME READY STATUS RESTARTS AGE NODE
gpu-pod 1/1 Running 0 10s worker-2
Taints and Tolerations
Taints repel pods unless the pod has a matching toleration:
apiVersion: v1
kind: Pod
metadata:
name: critical-app
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "critical"
effect: "NoSchedule"
containers:
- name: app
image: nginx
# Taint a node for critical workloads only
kubectl taint nodes worker-3 dedicated=critical:NoSchedule
# Apply the pod (it tolerates the taint)
kubectl apply -f critical-app.yaml
# Check tolerations in pod spec
kubectl get pod critical-app -o jsonpath='{.spec.tolerations}' | jq
Expected output:
[
{
"key": "dedicated",
"operator": "Equal",
"value": "critical",
"effect": "NoSchedule]
}
]
Taint Effects
| Effect | Behavior |
|---|---|
NoSchedule |
Don't schedule new pods without toleration |
PreferNoSchedule |
Try not to schedule, but not enforced |
NoExecute |
Evict existing pods without toleration |
Node Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: zone-aware-app
spec:
replicas: 6
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: instance-type
operator: In
values:
- c5.large
Pod Affinity and Anti-Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: co-located-services
spec:
replicas: 3
template:
spec:
affinity:
# Pods prefer to be near the cache
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: redis-cache
topologyKey: "kubernetes.io/hostname"
# Pods must not share a node with each other
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: co-located-services
topologyKey: "kubernetes.io/hostname"
Scheduler Logic Simulator
import random
class Node:
def __init__(self, name: str, cpu: int, memory: int, labels: dict):
self.name = name
self.cpu = cpu
self.memory = memory
self.cpu_used = 0
self.memory_used = 0
self.labels = labels
self.taints = []
def has_room(self, req_cpu: int, req_mem: int) -> bool:
return (self.cpu_used + req_cpu <= self.cpu and
self.memory_used + req_mem <= self.memory)
def schedule(self, req_cpu: int, req_mem: int):
self.cpu_used += req_cpu
self.memory_used += req_mem
def score(self, req_cpu: int, req_mem: int) -> float:
cpu_score = 1 - (self.cpu_used / self.cpu)
mem_score = 1 - (self.memory_used / self.memory)
return (cpu_score + mem_score) / 2
class Scheduler:
def __init__(self, nodes: list):
self.nodes = nodes
def schedule_pod(self, pod: dict) -> str:
# Filter
candidates = []
for node in self.nodes:
if not self._matches_node_selector(pod, node):
continue
if not self._tolerates_taints(pod, node):
continue
if not node.has_room(
pod["cpu_request"], pod["memory_request"]
):
continue
candidates.append(node)
if not candidates:
return "Pending (no suitable node)"
# Score (best-fit)
best_node = max(
candidates,
key=lambda n: n.score(
pod["cpu_request"], pod["memory_request"]
)
)
best_node.schedule(pod["cpu_request"], pod["memory_request"])
return best_node.name
def _matches_node_selector(self, pod: dict, node: Node) -> bool:
selector = pod.get("node_selector", {})
return all(node.labels.get(k) == v for k, v in selector.items())
def _tolerates_taints(self, pod: dict, node: Node) -> bool:
tolerations = pod.get("tolerations", [])
for taint in node.taints:
tolerated = any(
t["key"] == taint["key"]
and t.get("value") == taint.get("value")
and t.get("effect") == taint.get("effect")
for t in tolerations
)
if not tolerated:
return False
return True
nodes = [
Node("node-a", 4, 8192, {"zone": "us-east-1a", "gpu": "true"}),
Node("node-b", 8, 16384, {"zone": "us-east-1a"}),
Node("node-c", 4, 8192, {"zone": "us-east-1b"}),
]
nodes[0].taints = [{"key": "gpu", "value": "true", "effect": "NoSchedule"}]
sched = Scheduler(nodes)
pods = [
{"cpu_request": 1, "memory_request": 1024, "tolerations": [
{"key": "gpu", "value": "true", "effect": "NoSchedule"}
]},
{"cpu_request": 2, "memory_request": 2048, "node_selector": {"zone": "us-east-1a"}},
{"cpu_request": 1, "memory_request": 512},
]
for i, pod in enumerate(pods):
dest = sched.schedule_pod(pod)
print(f"Pod {i+1} -> {dest}")
Expected output:
Pod 1 -> node-a
Pod 2 -> node-b
Pod 3 -> node-c
Topology Spread Constraints
apiVersion: apps/v1
kind: Deployment
metadata:
name: spread-across-zones
spec:
replicas: 9
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: spread-across-zones
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
Custom Scheduler
# custom_scheduler.py
import kubernetes
from kubernetes import client, config
class CustomScheduler:
def __init__(self):
config.load_incluster_config()
self.v1 = client.CoreV1Api()
def schedule(self):
w = kubernetes.watch.Watch()
for event in w.stream(
self.v1.list_pod_for_all_namespaces,
field_selector="spec.nodeName=",
):
pod = event["object"]
if pod.spec.scheduler_name != "custom-scheduler":
continue
node = self._best_node(pod)
if node:
self._bind(pod, node)
print(f"Bound {pod.metadata.name} to {node}")
def _best_node(self, pod):
nodes = self.v1.list_node().items
for node in nodes:
# Simple round-robin / least-loaded
pods_on_node = self.v1.list_pod_for_all_namespaces(
field_selector=f"spec.nodeName={node.metadata.name}"
).items
if len(pods_on_node) < 10:
return node.metadata.name
return None
def _bind(self, pod, node):
binding = client.V1Binding(
target=client.V1ObjectReference(kind="Node", name=node)
)
self.v1.create_namespaced_binding(
pod.metadata.name, pod.metadata.namespace, binding
)
Common Mistakes
1. Using nodeSelector When nodeAffinity Is Needed
nodeSelector only matches exact label equality. nodeAffinity supports In, NotIn, Exists, DoesNotExist operators, required and preferred semantics, and multiple conditions.
2. Forgetting NoExecute Evicts Running Pods
Adding a NoExecute taint to a node evicts all pods without the matching toleration. This can cause unexpected downtime. Use NoSchedule first, verify pods are placed correctly, then add NoExecute.
3. Pod Anti-Affinity on Large Clusters
requiredDuringScheduling pod anti-affinity prevents pods from sharing nodes. With 100 replicas and 10 nodes, only 10 pods can be scheduled — the other 90 remain Pending. Use preferredDuringScheduling or topologySpreadConstraints instead.
4. Overusing Hard Constraints
Hard constraints (requiredDuringSchedulingIgnoredDuringExecution) prevent scheduling when no node matches. Use soft constraints (preferredDuringSchedulingIgnoredDuringExecution) with weights for flexibility.
5. Not Testing Scheduling During Node Failures
When a node fails, pods on it may remain in Unknown state. The scheduler doesn't automatically reschedule them unless they're part of a ReplicaSet/Deployment with --pod-eviction-timeout elapsed.
6. Ignoring Cluster Autoscaler Interactions
If the cluster autoscaler is active, unschedulable pods trigger node provisioning. Test that scheduling constraints don't prevent scale-up (e.g., a required node affinity to a label that new nodes don't have).
7. No Priority Classes for Critical Pods
Critical system pods (DNS, networking) should use priority classes to ensure they're scheduled before less important workloads. Without priority, a burst of low-priority pods can occupy resources needed by system components.
Practice Questions
1. What is the difference between nodeSelector and nodeAffinity?
nodeSelector only supports exact equality (key: value). nodeAffinity supports In, NotIn, Exists, DoesNotExist, Gt, Lt operators, and both required and preferred scheduling semantics.
2. How do taints differ from nodeAffinity?
Taints repel pods from nodes (a node-centric constraint). Node affinity attracts pods to nodes (a pod-centric constraint). Tolerations on pods override taints. Both are evaluated during scheduling.
3. What topologyKey values are commonly used for pod affinity?
Kubernetes.io/hostname (node-level), topology.Kubernetes.io/zone (availability zone-level), topology.Kubernetes.io/region (region-level). Custom labels can also be used.
4. How does the scheduler handle multiple priority functions?
The scheduler applies default priority functions (least requested, balanced resource allocation, node affinity, taint toleration). Each function assigns a score (0-100), weighted by configurable weights. The highest total score wins.
5. Challenge: Design scheduling for a multi-tenant SaaS cluster.
Tenant A needs GPU nodes with guaranteed capacity. Tenant B needs spot instances for batch jobs. Tenant C needs zone-spread for high availability. Each has different SLOs. Design a scheduling Strategy using taints, tolerations, priority classes, and node affinity.
Mini Project: Priority-Based Scheduler
class PriorityScheduler:
def __init__(self, nodes: list):
self.nodes = nodes
self.queue = []
def submit(self, pod: dict):
self.queue.append(pod)
self.queue.sort(key=lambda p: p.get("priority", 0), reverse=True)
def schedule_all(self):
results = []
while self.queue:
pod = self.queue.pop(0)
best = None
best_score = -1
for node in self.nodes:
if node.has_room(pod["cpu"], pod["mem"]):
s = node.score(pod["cpu"], pod["mem"])
if s > best_score:
best_score = s
best = node
if best:
best.schedule(pod["cpu"], pod["mem"])
results.append((pod["name"], best.name, pod["priority"]))
else:
results.append((pod["name"], "Pending", pod["priority"]))
return results
nodes = [Node(f"node-{i}", 4, 8192, {}) for i in range(2)]
sched = PriorityScheduler(nodes)
pods = [
{"name": "critical", "cpu": 3, "mem": 4096, "priority": 100},
{"name": "batch", "cpu": 3, "mem": 4096, "priority": 10},
{"name": "normal", "cpu": 1, "mem": 1024, "priority": 50},
]
for p in pods:
sched.submit(p)
for name, dest, pri in sched.schedule_all():
print(f"{name:>10} (pri {pri:>3}) -> {dest}")
Expected output:
critical (pri 100) -> node-0
normal (pri 50) -> node-1
batch (pri 10) -> Pending
FAQ
What's Next
Congratulations on completing this scheduling guide! Here's where to go from here:
- Practice daily — Add node selectors and taints to your deployments
- Build a project — Deploy a custom scheduler and run pods with it
- Explore related topics — Descheduler, cluster autoscaler, scheduling framework, scoring plugins
- Join the community — Share your scheduling strategies and get feedback
Remember: every expert was once a beginner. Keep scheduling!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro