Kubernetes Troubleshooting: Debugging Pods, Nodes & Networking
In this tutorial, you'll learn about Kubernetes Troubleshooting: Debugging Pods, Nodes & Networking. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Kubernetes troubleshooting requires systematic investigation of pod states, node health, network connectivity, and control plane components to identify and resolve cluster issues.
What You'll Learn
This tutorial covers debugging crashed pods with CrashLoopBackOff, investigating node failures, diagnosing DNS resolution problems, troubleshooting network policies, and checking etcd health.
Why It Matters
Production outages in Kubernetes require rapid diagnosis. Engineers who master troubleshooting techniques reduce mean time to resolution from hours to minutes.
Real-World Use
SRE teams at companies like Datadog and New Relic use systematic debugging approaches to resolve thousands of Kubernetes incidents annually, with the majority solved by understanding container logs and pod states.
Debugging Pods
CrashLoopBackOff
When a pod enters CrashLoopBackOff, the container starts and crashes repeatedly.
# Check pod status
kubectl get pods
# View logs from the last attempt
kubectl logs my-pod --previous
# Check pod events
kubectl describe pod my-pod
# Stream live logs
kubectl logs my-pod -f
Common causes include missing environment variables, failed readiness probes, out-of-memory errors, or configuration file issues.
ImagePullBackOff
When the kubelet cannot pull the container image.
# Check the error message
kubectl describe pod my-pod | grep -A 5 "Failed to pull image"
# Verify image name and tag
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].image}'
# Check image pull policy
kubectl get pod my-pod -o yaml | grep imagePullPolicy
Pending Pods
Pods stuck in Pending state cannot be scheduled.
# Check scheduling events
kubectl describe pod my-pod | grep -A 10 Events
# Check node resources
kubectl top nodes
# Check resource quotas
kubectl describe quota -n production
# Check taints on nodes
kubectl describe nodes | grep Taints
Debugging Nodes
Node NotReady
# List node status
kubectl get nodes
# Describe the unhealthy node
kubectl describe node worker-3
# Check kubelet logs on the node
journalctl -u kubelet -n 100 --no-pager
# Check node conditions
kubectl get node worker-3 -o jsonpath='{.status.conditions[*].type}'
Node Resource Pressure
# Check disk pressure
kubectl describe node worker-3 | grep -i pressure
# View disk usage on the node
df -h
# Check container runtime disk usage
du -sh /var/lib/containerd/
Debugging Networking
DNS Resolution
# Test DNS from within the cluster
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default
# Check CoreDNS pods
kubectl -n kube-system get pods -l k8s-app=kube-dns
# View CoreDNS logs
kubectl -n kube-system logs deployment/coredns
Service Connectivity
# Check service endpoints
kubectl get endpoints my-service
# Verify service DNS resolution
kubectl run test --image=busybox --rm -it -- wget -O- http://my-service:8080
# Test with a debug pod
kubectl run debug --image=nicolaka/netshoot -it --rm -- curl my-service:8080
Network Policy Blocking
# List network policies
kubectl get networkpolicies -A
# Simulate policy effect
kubectl run test --image=nicolaka/netshoot -it --rm -- nmap -p 8080 my-service
Debugging the Control Plane
etcd Health
# Check etcd endpoints
kubectl -n kube-system exec etcd-master -- etcdctl endpoint health
# Check etcd member list
kubectl -n kube-system exec etcd-master -- etcdctl member list
# Check etcd leader
kubectl -n kube-system exec etcd-master -- etcdctl endpoint status
API Server
# Check API server health
kubectl get --raw /healthz
# Check component statuses
kubectl get componentstatuses
# View API server audit logs
kubectl -n kube-system logs kube-apiserver-master | grep -i error
General Debugging Commands
# Get all events sorted by time
kubectl get events --sort-by=.lastTimestamp
# Watch all events in a namespace
kubectl get events -n production -w
# Export pod diagnostics
kubectl cluster-info dump --output-directory=./cluster-dump
Practice Questions
How do you view logs from a crashed container? Use kubectl logs pod-name --previous to see logs from the previous container instance.
What causes CrashLoopBackOff? The container starts and crashes repeatedly, with the backoff delay increasing after each failure.
How do you check if DNS resolution is working in the cluster? Run an nslookup from a test pod or check CoreDNS pod logs.
What command checks if a service has healthy endpoints? kubectl get endpoints service-name shows the list of pod IPs backing the service.
How do you diagnose a node that is NotReady? Check kubelet logs, node conditions, disk pressure, and control plane connectivity from the node.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro