Kubernetes StatefulSets Guide — Stateful Application Management
In this tutorial, you'll learn about Kubernetes StatefulSets Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Kubernetes StatefulSets manage stateful applications by providing stable network identities, ordered deployment and scaling, and persistent storage that follows each pod through rescheduling.
What You'll Learn
You'll master StatefulSets — stable pod identities with ordinal indexing and hostnames, PersistentVolumeClaim templates for per-pod storage, headless Services, ordered rolling updates, and graceful scaling for stateful workloads.
Why This Problem Matters
Deployments treat pods as interchangeable. Databases, Message Queues, and Distributed Systems need stable identities — each pod must be uniquely identifiable and maintain its storage across rescheduling. StatefulSets provide these guarantees for stateful applications.
Real-World Use
DodaZIP's metadata database cluster runs on StatefulSets with three replicas. Each pod has a dedicated PVC that persists across rescheduling. The headless service (postgres-0.postgres.dodatech.svc.cluster.local) ensures stable DNS names for Replication configuration.
StatefulSet Architecture
flowchart TB
subgraph StatefulSet
SS[StatefulSet: postgres]
SS --> Pod0[postgres-0]
SS --> Pod1[postgres-1]
SS --> Pod2[postgres-2]
end
subgraph HeadlessService
SVC[Service: postgres
clusterIP: None]
end
subgraph Storage
PVC0[PVC postgres-0 ➔ 100Gi]
PVC1[PVC postgres-1 ➔ 100Gi]
PVC2[PVC postgres-2 ➔ 100Gi]
end
subgraph DNS
DNS0[postgres-0.postgres.svc.cluster.local]
DNS1[postgres-1.postgres.svc.cluster.local]
DNS2[postgres-2.postgres.svc.cluster.local]
end
Pod0 --- PVC0
Pod0 --- DNS0
Pod1 --- PVC1
Pod1 --- DNS1
Pod2 --- PVC2
Pod2 --- DNS2
Pod0 --- SVC
Pod1 --- SVC
Pod2 --- SVC
Basic StatefulSet
# statefulset-postgres.yaml
apiVersion: v1
kind: Service
metadata:
name: postgres
labels:
app: postgres
spec:
clusterIP: None # Headless service
ports:
- port: 5432
name: postgres
selector:
app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
env:
- name: POSTGRES_PASSWORD
value: secret
ports:
- containerPort: 5432
name: postgres
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
kubectl apply -f statefulset-postgres.yaml
kubectl get sts
kubectl get pods -w
Expected output:
NAME READY STATUS RESTARTS AGE
postgres 3/3 Running 0 2m
postgres-0 1/1 Running 0 2m
postgres-1 1/1 Running 0 1m
postgres-2 1/1 Running 0 30s
Notice the ordered creation: postgres-0 starts first, then postgres-1 after it's Ready, then postgres-2.
Stable DNS Names
# Query DNS from another pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- \
nslookup postgres-0.postgres.default.svc.cluster.local
Expected output:
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: postgres-0.postgres.default.svc.cluster.local
Address 1: 10.244.1.5
Pod Identity Simulation
class StatefulSetPod:
def __init__(self, ordinal: int, statefulset: str, replicas: int):
self.ordinal = ordinal
self.name = f"{statefulset}-{ordinal}"
self.hostname = self.name
self.subdomain = statefulset
self.ready = False
def dns_name(self, namespace: str = "default") -> str:
return (f"{self.name}.{self.subdomain}."
f"{namespace}.svc.cluster.local")
def __repr__(self):
return (f"Pod({self.name}, "
f"dns={self.dns_name()}, "
f"ready={self.ready})")
class StatefulSet:
def __init__(self, name: str, replicas: int,
service_name: str = None):
self.name = name
self.replicas = replicas
self.service_name = service_name or name
self.pods = [
StatefulSetPod(i, name, replicas)
for i in range(replicas)
]
self.volumes = {}
def scale(self, new_replicas: int):
if new_replicas > self.replicas:
for i in range(self.replicas, new_replicas):
self.pods.append(
StatefulSetPod(i, self.name, new_replicas)
)
elif new_replicas < self.replicas:
self.pods = self.pods[:new_replicas]
print(f"Scaling down to {new_replicas}: "
f"pods {new_replicas}-{self.replicas - 1} "
f"terminated")
self.replicas = new_replicas
def rolling_update(self, new_image: str):
for pod in reversed(self.pods):
print(f"Updating {pod.name} to {new_image}...")
pod.ready = False
# Simulate update
pod.ready = True
print(f" {pod.name} updated")
def get_pod_by_ordinal(self, ordinal: int) -> StatefulSetPod:
return self.pods[ordinal]
sts = StatefulSet("postgres", 3)
for pod in sts.pods:
pod.ready = True
print(f" {pod.name}: {pod.dns_name()}")
print("\nScaling from 3 to 5...")
sts.scale(5)
for pod in sts.pods:
print(f" {pod.name}: hostname={pod.hostname}")
print("\nRolling update from postgres:16 to postgres:17...")
sts.rolling_update("postgres:17")
Expected output:
postgres-0: postgres-0.postgres.default.svc.cluster.local
postgres-1: postgres-1.postgres.default.svc.cluster.local
postgres-2: postgres-2.postgres.default.svc.cluster.local
Scaling from 3 to 5...
postgres-0: hostname=postgres-0
postgres-1: hostname=postgres-1
postgres-2: hostname=postgres-2
postgres-3: hostname=postgres-3
postgres-4: hostname=postgres-4
Rolling update from postgres:16 to postgres:17...
Updating postgres-2 to postgres:17...
postgres-2 updated
Updating postgres-1 to postgres:17...
postgres-1 updated
Updating postgres-0 to postgres:17...
postgres-0 updated
Ordered Pod Management
# Ordered pod management policies
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zookeeper
spec:
podManagementPolicy: OrderedReady # Default: create/delete one at a time
# Alternative: Parallel (start all pods simultaneously)
# podManagementPolicy: Parallel
serviceName: zookeeper
replicas: 3
template:
spec:
containers:
- name: zookeeper
image: zookeeper:3.9
Parallel Pod Management
For workloads where startup ordering doesn't matter:
spec:
podManagementPolicy: Parallel
Parallel is used for workloads that can handle all pods starting simultaneously (e.g., Cassandra, where each node joins the ring independently).
Update Strategy
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
updateStrategy:
type: RollingUpdate # Default
rollingUpdate:
maxUnavailable: 1 # How many pods can be down during update
partition: 0 # Only update ordinals >= partition
# Alternative: OnDelete (manual pod deletion triggers update)
# updateStrategy:
# type: OnDelete
Canary Updates with Partition
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2 # Only update pods with ordinal >= 2
With partition: 2, only pod-2, pod-3, etc. are updated. Pod-0 and pod-1 stay on the old version — a canary for production testing.
Persistent Storage Per Pod
class VolumeClaimTemplate:
def __init__(self, name: str, size: str, storage_class: str):
self.name = name
self.size = size
self.storage_class = storage_class
def claim_name(self, pod_name: str) -> str:
return f"{self.name}-{pod_name}"
class StatefulSetPVCManager:
def __init__(self):
self.pvcs = {}
def create_pvc(self, sts_name: str, ordinal: int,
template: VolumeClaimTemplate):
pod_name = f"{sts_name}-{ordinal}"
claim_name = template.claim_name(pod_name)
if claim_name not in self.pvcs:
self.pvcs[claim_name] = {
"size": template.size,
"storage_class": template.storage_class,
"bound_to": pod_name,
"status": "Bound"
}
print(f"Created PVC {claim_name} ({template.size}) "
f"for {pod_name}")
def delete_pod_pvcs(self, sts_name: str, ordinal: int):
pod_name = f"{sts_name}-{ordinal}"
to_delete = [
name for name, pvc in self.pvcs.items()
if pvc["bound_to"] == pod_name
]
for name in to_delete:
del self.pvcs[name]
print(f"Deleted PVC {name}")
def verify_storage_retention(self, sts_name: str, ordinal: int):
pod_name = f"{sts_name}-{ordinal}"
claim_name = f"data-{pod_name}"
return claim_name in self.pvcs
manager = StatefulSetPVCManager()
template = VolumeClaimTemplate("data", "100Gi", "fast-ssd")
for i in range(3):
manager.create_pvc("postgres", i, template)
print(f"\nPVC for postgres-0 exists: "
f"{manager.verify_storage_retention('postgres', 0)}")
manager.delete_pod_pvcs("postgres", 0)
print(f"PVC for postgres-0 after delete: "
f"{manager.verify_storage_retention('postgres', 0)}")
Expected output:
Created PVC data-postgres-0 (100Gi) for postgres-0
Created PVC data-postgres-1 (100Gi) for postgres-1
Created PVC data-postgres-2 (100Gi) for postgres-2
PVC for postgres-0 exists: True
Deleted PVC data-postgres-0
PVC for postgres-0 after delete: False
Common Mistakes
1. Using Deployment for Stateful Workloads
Deployments don't guarantee stable pod identities or storage persistence. When a pod is recreated, it gets a random name and may not mount the same PVC. Always use StatefulSet for databases, queues, and stateful services.
2. Not Using Headless Service
Without a headless Service (clusterIP: None), pods get random DNS names. StatefulSet requires a headless service for stable network identities.
3. Scaling Down Without Draining
Scaling a StatefulSet from 5 to 3 deletes pods 4 and 3. If these are database nodes, data may be lost unless Replication has caught up. Use PodDisruptionBudget and drain the nodes before scaling down.
4. Forgetting PVC Retention
When you delete a StatefulSet, the PVCs remain (they're not owned by the StatefulSet). This prevents data loss but also means storage costs continue. Manage PVC lifecycle separately.
5. Using ReadWriteOnce for Shared Access
ReadWriteOnce can only be mounted by one node. If multi-node read access is needed, use ReadWriteMany via NFS or EFS. Each StatefulSet pod gets its own RWO volume.
6. Ordered Pod Deletion Without Dependencies
StatefulSet deletes pods in reverse ordinal order (3, 2, 1, 0). If pod-2 depends on pod-0 (e.g., Replication), the deletion order may cause issues. Handle dependencies in PreStop hooks.
7. No StatefulSet-Specific Monitoring
StatefulSet failures often involve storage (PVC pending, volume attachment timeout). Monitor PVC status, volume attachment errors, and pod eviction events separately from stateless deployments.
Practice Questions
1. How does a StatefulSet differ from a Deployment?
StatefulSet provides stable pod identity (pod-name-index), ordered deployment/scaling, and per-pod persistent storage. Deployment provides identical, interchangeable pods with no guaranteed identity or storage persistence.
2. What is the purpose of the headless service in a StatefulSet?
The headless service (clusterIP: None) enables DNS-based pod discovery. Each pod gets a DNS A record like pod-name.service-name.namespace.svc.cluster.local, resolving directly to the pod's IP.
3. How does rolling update work in StatefulSet?
Pods are updated in reverse ordinal order (largest to smallest). Each pod is terminated and recreated with the new spec before the next one is updated. With partition, you can control where the update starts, enabling canary deployments.
4. What happens to PVCs when a StatefulSet is scaled down?
PVCs are NOT deleted when the StatefulSet scale reduces. The PVCs remain in the cluster to preserve data. To delete them, you must manually delete the PVCs or use kubectl delete sts --cascade=orphan and handle cleanup separately.
5. Challenge: Design a StatefulSet-based Cassandra cluster.
Cassandra needs each node to have a unique identity (for gossip protocol), persistent storage, and ordered bootstrap (first node seeds the cluster). Design the StatefulSet configuration with: headless service, volumeClaimTemplates, ordered pod management, initial readiness check that waits for the seed node, and update strategy with partition for rolling upgrades.
Mini Project: StatefulSet Cluster Manager
import time
class ClusterNode:
def __init__(self, ordinal: int, cluster_size: int, seed: bool = False):
self.ordinal = ordinal
self.name = f"node-{ordinal}"
self.seed = seed
self.data = {}
self.ready = False
def join_cluster(self, seed_node):
print(f" {self.name} joining via {seed_node.name}")
self.ready = True
def __repr__(self):
return (f"{self.name} (seed={self.seed}, "
f"ready={self.ready})")
class StatefulCluster:
def __init__(self, name: str, replicas: int):
self.name = name
self.nodes = []
self.seed_node = None
self.scale_to(replicas)
def scale_to(self, n: int):
if n > len(self.nodes):
for i in range(len(self.nodes), n):
is_seed = (i == 0)
node = ClusterNode(i, n, seed=is_seed)
self.nodes.append(node)
if is_seed:
self.seed_node = node
elif n < len(self.nodes):
self.nodes = self.nodes[:n]
print(f"\nCluster scaled to {n} nodes:")
self.bootstrap()
def bootstrap(self):
for node in self.nodes:
if node.seed:
node.ready = True
print(f" {node.name} bootstrapped (seed)")
elif self.seed_node:
node.join_cluster(self.seed_node)
cluster = StatefulCluster("cassandra", 3)
print("\nNodes:")
for n in cluster.nodes:
print(f" {n}")
Expected output:
Cluster scaled to 3 nodes:
node-0 bootstrapped (seed)
node-1 joining via node-0
node-2 joining via node-0
Nodes:
node-0 (seed=True, ready=True)
node-1 (seed=False, ready=True)
node-2 (seed=False, ready=True)
FAQ
What's Next
Congratulations on completing this StatefulSets guide! Here's where to go from here:
- Practice daily — Deploy a StatefulSet with a database
- Build a project — Set up a Cassandra or PostgreSQL cluster on StatefulSets
- Explore related topics — Operator pattern for databases, volume snapshots, backup/restore with Velero
- Join the community — Share your StatefulSet configurations and get feedback
Remember: every expert was once a beginner. Keep stateful!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro