Skip to content

Kubernetes Scheduling Guide — Pod Placement Control

DodaTech Updated 2026-06-24 9 min read

In this tutorial, you'll learn about Kubernetes Scheduling Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Kubernetes scheduling assigns pods to nodes based on resource requirements, constraints, and policies, optimizing cluster utilization while respecting placement rules.

What You'll Learn

You'll master Kubernetes scheduling — the scheduler workflow (filter, score, bind), node selectors, taints and tolerations, node and pod affinity/anti-affinity, topology spread constraints, and custom scheduler patterns.

Why This Problem Matters

Default scheduling spreads pods randomly across nodes. Without explicit placement control, critical pods may land on under-provisioned nodes, latency-sensitive pods may be placed far apart, and resources may be fragmented inefficiently.

Real-World Use

Doda Browser's infrastructure team uses taints and tolerations to isolate GPU nodes for ML inference workloads. Node affinity ensures web servers are co-located with their local Redis cache. Pod anti-affinity prevents two replicas of the same service from sharing a node.

Scheduler Workflow

flowchart TB
  Pod[Unscheduled Pod] --> Queue[Scheduling Queue]
  Queue --> Filter[Scheduling Filter]
  subgraph FilteringPhase
    F1[Node Unschedulable?]
    F2[Resource Fit?]
    F3[Taints Tolerated?]
    F4[Node Selector Match?]
    F5[Affinity Rules?]
  end
  Filter --> FilteringPhase
  FilteringPhase -->|Yes| Scoring[Scoring Phase]
  FilteringPhase -->|No| Skip[Node Skipped]
  Scoring --> Score[Score Nodes]
  Score --> Bind[Bind to Highest Score]
  Bind --> Kubelet[Kubelet Starts Pod]

Node Selector

The simplest scheduling constraint:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  nodeSelector:
    gpu: "true"  # Only nodes with this label
  containers:
    - name: ml-worker
      image: tensorflow/tensorflow:latest-gpu
# Label a node
kubectl label node worker-2 gpu=true

# Check where the pod was scheduled
kubectl get pod gpu-pod -o wide

Expected output:

NAME       READY   STATUS    RESTARTS   AGE   NODE
gpu-pod    1/1     Running   0          10s   worker-2

Taints and Tolerations

Taints repel pods unless the pod has a matching toleration:

apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "critical"
      effect: "NoSchedule"
  containers:
    - name: app
      image: nginx
# Taint a node for critical workloads only
kubectl taint nodes worker-3 dedicated=critical:NoSchedule

# Apply the pod (it tolerates the taint)
kubectl apply -f critical-app.yaml

# Check tolerations in pod spec
kubectl get pod critical-app -o jsonpath='{.spec.tolerations}' | jq

Expected output:

[
  {
    "key": "dedicated",
    "operator": "Equal",
    "value": "critical",
    "effect": "NoSchedule]
  }
]

Taint Effects

Effect Behavior
NoSchedule Don't schedule new pods without toleration
PreferNoSchedule Try not to schedule, but not enforced
NoExecute Evict existing pods without toleration

Node Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zone-aware-app
spec:
  replicas: 6
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - us-east-1a
                      - us-east-1b
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              preference:
                matchExpressions:
                  - key: instance-type
                    operator: In
                    values:
                      - c5.large

Pod Affinity and Anti-Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: co-located-services
spec:
  replicas: 3
  template:
    spec:
      affinity:
        # Pods prefer to be near the cache
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: redis-cache
                topologyKey: "kubernetes.io/hostname"
        # Pods must not share a node with each other
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: co-located-services
              topologyKey: "kubernetes.io/hostname"

Scheduler Logic Simulator

import random

class Node:
    def __init__(self, name: str, cpu: int, memory: int, labels: dict):
        self.name = name
        self.cpu = cpu
        self.memory = memory
        self.cpu_used = 0
        self.memory_used = 0
        self.labels = labels
        self.taints = []

    def has_room(self, req_cpu: int, req_mem: int) -> bool:
        return (self.cpu_used + req_cpu <= self.cpu and
                self.memory_used + req_mem <= self.memory)

    def schedule(self, req_cpu: int, req_mem: int):
        self.cpu_used += req_cpu
        self.memory_used += req_mem

    def score(self, req_cpu: int, req_mem: int) -> float:
        cpu_score = 1 - (self.cpu_used / self.cpu)
        mem_score = 1 - (self.memory_used / self.memory)
        return (cpu_score + mem_score) / 2

class Scheduler:
    def __init__(self, nodes: list):
        self.nodes = nodes

    def schedule_pod(self, pod: dict) -> str:
        # Filter
        candidates = []
        for node in self.nodes:
            if not self._matches_node_selector(pod, node):
                continue
            if not self._tolerates_taints(pod, node):
                continue
            if not node.has_room(
                pod["cpu_request"], pod["memory_request"]
            ):
                continue
            candidates.append(node)

        if not candidates:
            return "Pending (no suitable node)"

        # Score (best-fit)
        best_node = max(
            candidates,
            key=lambda n: n.score(
                pod["cpu_request"], pod["memory_request"]
            )
        )
        best_node.schedule(pod["cpu_request"], pod["memory_request"])
        return best_node.name

    def _matches_node_selector(self, pod: dict, node: Node) -> bool:
        selector = pod.get("node_selector", {})
        return all(node.labels.get(k) == v for k, v in selector.items())

    def _tolerates_taints(self, pod: dict, node: Node) -> bool:
        tolerations = pod.get("tolerations", [])
        for taint in node.taints:
            tolerated = any(
                t["key"] == taint["key"]
                and t.get("value") == taint.get("value")
                and t.get("effect") == taint.get("effect")
                for t in tolerations
            )
            if not tolerated:
                return False
        return True

nodes = [
    Node("node-a", 4, 8192, {"zone": "us-east-1a", "gpu": "true"}),
    Node("node-b", 8, 16384, {"zone": "us-east-1a"}),
    Node("node-c", 4, 8192, {"zone": "us-east-1b"}),
]
nodes[0].taints = [{"key": "gpu", "value": "true", "effect": "NoSchedule"}]

sched = Scheduler(nodes)

pods = [
    {"cpu_request": 1, "memory_request": 1024, "tolerations": [
        {"key": "gpu", "value": "true", "effect": "NoSchedule"}
    ]},
    {"cpu_request": 2, "memory_request": 2048, "node_selector": {"zone": "us-east-1a"}},
    {"cpu_request": 1, "memory_request": 512},
]

for i, pod in enumerate(pods):
    dest = sched.schedule_pod(pod)
    print(f"Pod {i+1} -> {dest}")

Expected output:

Pod 1 -> node-a
Pod 2 -> node-b
Pod 3 -> node-c

Topology Spread Constraints

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spread-across-zones
spec:
  replicas: 9
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: spread-across-zones
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway

Custom Scheduler

# custom_scheduler.py
import kubernetes
from kubernetes import client, config

class CustomScheduler:
    def __init__(self):
        config.load_incluster_config()
        self.v1 = client.CoreV1Api()

    def schedule(self):
        w = kubernetes.watch.Watch()
        for event in w.stream(
            self.v1.list_pod_for_all_namespaces,
            field_selector="spec.nodeName=",
        ):
            pod = event["object"]
            if pod.spec.scheduler_name != "custom-scheduler":
                continue

            node = self._best_node(pod)
            if node:
                self._bind(pod, node)
                print(f"Bound {pod.metadata.name} to {node}")

    def _best_node(self, pod):
        nodes = self.v1.list_node().items
        for node in nodes:
            # Simple round-robin / least-loaded
            pods_on_node = self.v1.list_pod_for_all_namespaces(
                field_selector=f"spec.nodeName={node.metadata.name}"
            ).items
            if len(pods_on_node) < 10:
                return node.metadata.name
        return None

    def _bind(self, pod, node):
        binding = client.V1Binding(
            target=client.V1ObjectReference(kind="Node", name=node)
        )
        self.v1.create_namespaced_binding(
            pod.metadata.name, pod.metadata.namespace, binding
        )

Common Mistakes

1. Using nodeSelector When nodeAffinity Is Needed

nodeSelector only matches exact label equality. nodeAffinity supports In, NotIn, Exists, DoesNotExist operators, required and preferred semantics, and multiple conditions.

2. Forgetting NoExecute Evicts Running Pods

Adding a NoExecute taint to a node evicts all pods without the matching toleration. This can cause unexpected downtime. Use NoSchedule first, verify pods are placed correctly, then add NoExecute.

3. Pod Anti-Affinity on Large Clusters

requiredDuringScheduling pod anti-affinity prevents pods from sharing nodes. With 100 replicas and 10 nodes, only 10 pods can be scheduled — the other 90 remain Pending. Use preferredDuringScheduling or topologySpreadConstraints instead.

4. Overusing Hard Constraints

Hard constraints (requiredDuringSchedulingIgnoredDuringExecution) prevent scheduling when no node matches. Use soft constraints (preferredDuringSchedulingIgnoredDuringExecution) with weights for flexibility.

5. Not Testing Scheduling During Node Failures

When a node fails, pods on it may remain in Unknown state. The scheduler doesn't automatically reschedule them unless they're part of a ReplicaSet/Deployment with --pod-eviction-timeout elapsed.

6. Ignoring Cluster Autoscaler Interactions

If the cluster autoscaler is active, unschedulable pods trigger node provisioning. Test that scheduling constraints don't prevent scale-up (e.g., a required node affinity to a label that new nodes don't have).

7. No Priority Classes for Critical Pods

Critical system pods (DNS, networking) should use priority classes to ensure they're scheduled before less important workloads. Without priority, a burst of low-priority pods can occupy resources needed by system components.

Practice Questions

1. What is the difference between nodeSelector and nodeAffinity?

nodeSelector only supports exact equality (key: value). nodeAffinity supports In, NotIn, Exists, DoesNotExist, Gt, Lt operators, and both required and preferred scheduling semantics.

2. How do taints differ from nodeAffinity?

Taints repel pods from nodes (a node-centric constraint). Node affinity attracts pods to nodes (a pod-centric constraint). Tolerations on pods override taints. Both are evaluated during scheduling.

3. What topologyKey values are commonly used for pod affinity?

Kubernetes.io/hostname (node-level), topology.Kubernetes.io/zone (availability zone-level), topology.Kubernetes.io/region (region-level). Custom labels can also be used.

4. How does the scheduler handle multiple priority functions?

The scheduler applies default priority functions (least requested, balanced resource allocation, node affinity, taint toleration). Each function assigns a score (0-100), weighted by configurable weights. The highest total score wins.

5. Challenge: Design scheduling for a multi-tenant SaaS cluster.

Tenant A needs GPU nodes with guaranteed capacity. Tenant B needs spot instances for batch jobs. Tenant C needs zone-spread for high availability. Each has different SLOs. Design a scheduling Strategy using taints, tolerations, priority classes, and node affinity.

Mini Project: Priority-Based Scheduler

class PriorityScheduler:
    def __init__(self, nodes: list):
        self.nodes = nodes
        self.queue = []

    def submit(self, pod: dict):
        self.queue.append(pod)
        self.queue.sort(key=lambda p: p.get("priority", 0), reverse=True)

    def schedule_all(self):
        results = []
        while self.queue:
            pod = self.queue.pop(0)
            best = None
            best_score = -1
            for node in self.nodes:
                if node.has_room(pod["cpu"], pod["mem"]):
                    s = node.score(pod["cpu"], pod["mem"])
                    if s > best_score:
                        best_score = s
                        best = node
            if best:
                best.schedule(pod["cpu"], pod["mem"])
                results.append((pod["name"], best.name, pod["priority"]))
            else:
                results.append((pod["name"], "Pending", pod["priority"]))
        return results

nodes = [Node(f"node-{i}", 4, 8192, {}) for i in range(2)]
sched = PriorityScheduler(nodes)

pods = [
    {"name": "critical", "cpu": 3, "mem": 4096, "priority": 100},
    {"name": "batch", "cpu": 3, "mem": 4096, "priority": 10},
    {"name": "normal", "cpu": 1, "mem": 1024, "priority": 50},
]

for p in pods:
    sched.submit(p)

for name, dest, pri in sched.schedule_all():
    print(f"{name:>10} (pri {pri:>3}) -> {dest}")

Expected output:

  critical (pri 100) -> node-0
    normal (pri  50) -> node-1
     batch (pri  10) -> Pending

FAQ

Can I run multiple schedulers in the same cluster?

Yes. Each pod can specify a schedulerName in its spec. The default scheduler handles pods with no schedulerName. You can deploy custom schedulers as Deployments that watch for pods with specific scheduler names.

What happens if the default scheduler fails?

The scheduler runs as a Deployment in the kube-system namespace. If it fails, pods remain in Pending state. There's no failover scheduler by default. Deploy the scheduler with a replica count > 1 and leader election for high availability.

How does the scheduler handle resource fragmentation?

The scheduler's default scoring favors nodes that leave balanced resources (CPU and memory proportionally available). This reduces fragmentation. Custom scoring plugins can implement bin-packing or spread strategies.

What's Next

Kubernetes Service Mesh Guide
Kubernetes Priority & Preemption
Kubernetes Pod Lifecycle Guide

Congratulations on completing this scheduling guide! Here's where to go from here:

  • Practice daily — Add node selectors and taints to your deployments
  • Build a project — Deploy a custom scheduler and run pods with it
  • Explore related topics — Descheduler, cluster autoscaler, scheduling framework, scoring plugins
  • Join the community — Share your scheduling strategies and get feedback

Remember: every expert was once a beginner. Keep scheduling!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro