Docker Restart Policies & Health Checks — Keep Containers Running
In this tutorial, you'll learn about Docker Restart Policies & Health Checks. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Docker restart policies and health checks define how containers recover from failures and how Docker determines whether a container is actually serving traffic, not just running.
What You'll Learn
You'll master Docker's restart policies (no, always, on-failure, unless-stopped), the HEALTHCHECK instruction, start period for graceful startup, and automated recovery patterns for production.
Why This Problem Matters
Containers crash. Memory leaks kill processes. Network partitions drop connections. Without restart policies, a crashed container stays dead until someone manually restarts it. Health checks ensure your load balancer only sends traffic to containers that respond correctly.
Real-World Use
Durga Antivirus Pro uses Docker restart policies with health checks for its scanning worker containers. When a scan worker hits a memory limit and crashes, the on-failure:5 policy restarts it automatically. Health checks ensure the worker only receives jobs after it finishes loading virus signatures.
Restart Policy Behavior
flowchart TB
Start[Container Starts] --> Run[Running]
Run -->|Process exits| Check{Exit Code}
Check -->|0: Clean exit| No[no: Do nothing]
Check -->|Non-zero: Error| OnFail{on-failure?}
OnFail -->|yes| Count{Restart count < max?}
Count -->|yes| Delay[Wait delay
100ms → 200ms → 400ms]
Delay --> Start
Count -->|no| Stop[Stop permanently]
OnFail -->|no| Always{always / unless-stopped?}
Always -->|always| Start
Always -->|unless-stopped & manually stopped| ManualStop[Do not restart]
Always -->|unless-stopped & daemon restart| Start
Restart Policies in Action
# No restart (default)
docker run --restart no alpine echo "one shot"
# Always restart (service containers)
docker run -d --name web --restart always nginx
# Restart on failure (up to 5 times)
docker run -d --name worker --restart on-failure:5 myapp
# Unless manually stopped
docker run -d --name db --restart unless-stopped postgres
import time
import docker
client = docker.from_env()
# Create a container that exits with error
container = client.containers.run(
"alpine",
"sh -c 'exit 1'",
restart_policy={"Name": "on-failure", "MaximumRetryCount": 3},
detach=True
)
# Wait and check restart count
time.sleep(3)
container.reload()
print(f"Status: {container.status}")
print(f"Restart count: {container.attrs['RestartCount']}")
container.remove(force=True)
Expected output:
Status: running
Restart count: 3
Exit code 1 triggers restart. The container restarts up to 3 times, then stops permanently.
Health Check Basics
# Single health check example
FROM node:20-alpine
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
EXPOSE 3000
CMD ["node", "server.js"]
# Check health status
docker ps --format "table {{.Names}}\t{{.Status}}"
# Inspect health check results
docker inspect --format='{{json .State.Health}}' web | jq
Expected output:
NAMES STATUS
web Up 2 minutes (healthy)
{
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2026-06-24T10:00:05Z",
"End": "2026-06-24T10:00:05Z",
"ExitCode": 0,
"Output": "200 OK]
}
]
}
Custom Health Check Script
import http.server
import json
import time
class HealthHandler(http.server.BaseHTTPRequestHandler):
healthy = False
def do_GET(self):
if self.path == "/health":
if HealthHandler.healthy:
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps({"status": "ok"}).encode())
else:
self.send_response(503)
self.end_headers()
else:
self.send_response(200)
self.end_headers()
def simulate_startup():
time.sleep(5)
HealthHandler.healthy = True
threading.Thread(target=simulate_startup, daemon=True).start()
import threading
server = http.server.HTTPServer(("0.0.0.0", 8000), HealthHandler)
server.serve_forever()
FROM python:3.12-slim
COPY healthcheck.py /app/healthcheck.py
HEALTHCHECK --interval=5s --timeout=2s --start-period=8s --retries=2 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
CMD ["python", "/app/healthcheck.py"]
During the start period (8s), the health check runs but failures don't count toward retries. This allows the application to initialize before being considered unhealthy.
Application-Level Health Endpoint
from flask import Flask, jsonify
import redis
import psycopg2
app = Flask(__name__)
@app.route("/health")
def health():
health_status = {"status": "healthy", "checks": {}}
# Check Redis
try:
r = redis.Redis(host="redis", port=6379, socket_connect_timeout=2)
r.ping()
health_status["checks"]["redis"] = "ok"
except Exception as e:
health_status["checks"]["redis"] = f"unhealthy: {str(e)}"
health_status["status"] = "unhealthy"
# Check PostgreSQL
try:
conn = psycopg2.connect(
host="db", dbname="myapp",
user="user", password="pass",
connect_timeout=2
)
cur = conn.cursor()
cur.execute("SELECT 1")
cur.close()
conn.close()
health_status["checks"]["database"] = "ok"
except Exception as e:
health_status["checks"]["database"] = f"unhealthy: {str(e)}"
health_status["status"] = "unhealthy"
status_code = 200 if health_status["status"] == "healthy" else 503
return jsonify(health_status), status_code
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
curl http://localhost:5000/health | python3 -m json.tool
Expected output:
{
"status": "healthy",
"checks": {
"redis": "ok",
"database": "ok"
}
}
Restart Policy with Compose
# docker-compose.yml
services:
web:
image: nginx
restart: always
healthcheck:
test: ["CMD", "wget", "--spider", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
worker:
image: myapp/worker
restart: on-failure:5
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9090/ready || exit 1"]
interval: 15s
start_period: 30s
db:
image: postgres:16
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
Automated Recovery with Docker Events
import docker
import subprocess
import json
client = docker.from_env()
def handle_event(event):
if event["Type"] == "container" and event["Action"] == "health_status":
container_id = event["Actor"]["ID"]
container = client.containers.get(container_id)
health = container.attrs.get("State", {}).get("Health", {})
if health.get("Status") == "unhealthy":
print(f"Container {container.name} is unhealthy. Restarting...")
container.restart()
# Log the incident
log_entry = {
"container": container.name,
"action": "restarted",
"reason": "health_check_failure",
"failing_streak": health.get("FailingStreak")
}
with open("/var/log/container-recovery.json", "a") as f:
f.write(json.dumps(log_entry) + "\n")
# Monitor events
for event in client.events(decode=True):
handle_event(event)
Common Mistakes
1. Using --restart always for One-Shot Jobs
A batch job that exits with code 0 (--restart always) restarts forever. Use --restart no for jobs or --restart on-failure if retry on error is desired.
2. Ignoring Start Period
Without --start-period, a container that takes 30 seconds to initialize appears unhealthy for 6 consecutive intervals (at 5s each), triggering automatic restarts in a crash loop.
3. Health Check with Infinite Retries
A health check command that hangs (e.g., curl without --connect-timeout) blocks Docker's health check system. Always set timeouts on health check commands.
4. No Health Check on Database Containers
pg_isready or mysqladmin ping as health checks let dependent services wait for the database to be ready, preventing the "app starts before DB" Race Condition.
5. Overlapping Restart and Swarm/Orchestrator Policies
If Docker Swarm or Kubernetes manages restart policies, don't also set Docker restart policies. The orchestrator's restart handling conflicts with Docker's.
6. Health Check Port vs Application Port
A health check that probes a different port than the application port gives false positives. The web server may be running on port 80 but the application on port 3000 may be unresponsive.
7. Too Frequent Health Checks
--interval=1s on 100 containers creates 100 health checks per second. Scale: 30s interval per container. For higher frequency, use external monitoring.
Practice Questions
1. What is the difference between always and unless-stopped?
Both restart automatically when the Docker daemon restarts. But unless-stopped respects a manual stop — if you run docker stop, it won't restart. always restarts even after a manual stop.
2. How does the start period affect health checks?
During the start period, health check failures don't count toward retries. The container is assumed to be starting up. After the start period expires, every failure counts.
3. What exit codes trigger on-failure restart?
Any non-zero exit code. Exit code 0 (clean exit) does not trigger restart. Exit codes 1-255 trigger the restart policy.
4. How do you view health check logs?
docker inspect --format='{{json .State.Health}}' <container> shows the full health check history, including output, exit codes, and timestamps of each check.
5. Challenge: Design a graceful shutdown pattern for health checks.
When draining traffic from a container (e.g., during rolling update), the health endpoint should return 503 so the load balancer removes the container from rotation. After a cooldown period, the container receives SIGTERM. Implement this with a health check endpoint that becomes unhealthy on signal.
Mini Project: Health Check Tester
import http.server
import json
import signal
import sys
class GracefulShutdownHandler(http.server.BaseHTTPRequestHandler):
shutting_down = False
def do_GET(self):
if self.path == "/health":
if GracefulShutdownHandler.shutting_down:
self.send_response(503)
self.end_headers()
self.wfile.write(b"draining")
else:
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps({
"status": "healthy",
"uptime_seconds": GracefulShutdownHandler.uptime
}).encode())
def handle_sigterm(signum, frame):
print("Received SIGTERM, entering drain mode...")
GracefulShutdownHandler.shutting_down = True
time.sleep(5)
sys.exit(0)
import time
GracefulShutdownHandler.uptime = 0
signal.signal(signal.SIGTERM, handle_sigterm)
server = http.server.HTTPServer(("0.0.0.0", 8000), GracefulShutdownHandler)
def uptime_counter():
while True:
time.sleep(1)
GracefulShutdownHandler.uptime += 1
threading.Thread(target=uptime_counter, daemon=True).start()
import threading
print("Server running on :8000")
server.serve_forever()
FROM python:3.12-slim
COPY graceful.py /app/graceful.py
HEALTHCHECK --interval=5s --timeout=2s --retries=2 --start-period=3s \
CMD curl -f http://localhost:8000/health || exit 1
STOPSIGNAL SIGTERM
CMD ["python", "/app/graceful.py"]
docker build -t graceful-app:latest .
docker run -d --restart on-failure -p 8000:8000 --name graceful graceful-app:latest
sleep 5
docker stop graceful
# Check logs - should show drain mode before exit
docker logs graceful
Expected output:
Server running on :8000
Received SIGTERM, entering drain mode...
FAQ
What's Next
Congratulations on completing this restart policies guide! Here's where to go from here:
- Practice daily — Add health checks to every container you run
- Build a project — Create a self-healing application with Docker restart + health checks
- Explore related topics — Graceful shutdown patterns, readiness probes in Kubernetes, container lifecycle hooks
- Join the community — Share your health check designs and get feedback
Remember: every expert was once a beginner. Keep recovering!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro