Skip to content

Docker Restart Policies & Health Checks — Keep Containers Running

DodaTech Updated 2026-06-24 8 min read

In this tutorial, you'll learn about Docker Restart Policies & Health Checks. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Docker restart policies and health checks define how containers recover from failures and how Docker determines whether a container is actually serving traffic, not just running.

What You'll Learn

You'll master Docker's restart policies (no, always, on-failure, unless-stopped), the HEALTHCHECK instruction, start period for graceful startup, and automated recovery patterns for production.

Why This Problem Matters

Containers crash. Memory leaks kill processes. Network partitions drop connections. Without restart policies, a crashed container stays dead until someone manually restarts it. Health checks ensure your load balancer only sends traffic to containers that respond correctly.

Real-World Use

Durga Antivirus Pro uses Docker restart policies with health checks for its scanning worker containers. When a scan worker hits a memory limit and crashes, the on-failure:5 policy restarts it automatically. Health checks ensure the worker only receives jobs after it finishes loading virus signatures.

Restart Policy Behavior

flowchart TB
  Start[Container Starts] --> Run[Running]
  Run -->|Process exits| Check{Exit Code}
  Check -->|0: Clean exit| No[no: Do nothing]
  Check -->|Non-zero: Error| OnFail{on-failure?}
  OnFail -->|yes| Count{Restart count < max?}
  Count -->|yes| Delay[Wait delay
100ms → 200ms → 400ms] Delay --> Start Count -->|no| Stop[Stop permanently] OnFail -->|no| Always{always / unless-stopped?} Always -->|always| Start Always -->|unless-stopped & manually stopped| ManualStop[Do not restart] Always -->|unless-stopped & daemon restart| Start

Restart Policies in Action

# No restart (default)
docker run --restart no alpine echo "one shot"

# Always restart (service containers)
docker run -d --name web --restart always nginx

# Restart on failure (up to 5 times)
docker run -d --name worker --restart on-failure:5 myapp

# Unless manually stopped
docker run -d --name db --restart unless-stopped postgres
import time
import docker

client = docker.from_env()

# Create a container that exits with error
container = client.containers.run(
    "alpine",
    "sh -c 'exit 1'",
    restart_policy={"Name": "on-failure", "MaximumRetryCount": 3},
    detach=True
)

# Wait and check restart count
time.sleep(3)
container.reload()
print(f"Status: {container.status}")
print(f"Restart count: {container.attrs['RestartCount']}")
container.remove(force=True)

Expected output:

Status: running
Restart count: 3

Exit code 1 triggers restart. The container restarts up to 3 times, then stops permanently.

Health Check Basics

# Single health check example
FROM node:20-alpine

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

EXPOSE 3000
CMD ["node", "server.js"]
# Check health status
docker ps --format "table {{.Names}}\t{{.Status}}"

# Inspect health check results
docker inspect --format='{{json .State.Health}}' web | jq

Expected output:

NAMES               STATUS
web                 Up 2 minutes (healthy)

{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2026-06-24T10:00:05Z",
      "End": "2026-06-24T10:00:05Z",
      "ExitCode": 0,
      "Output": "200 OK]
    }
  ]
}

Custom Health Check Script

import http.server
import json
import time

class HealthHandler(http.server.BaseHTTPRequestHandler):
    healthy = False

    def do_GET(self):
        if self.path == "/health":
            if HealthHandler.healthy:
                self.send_response(200)
                self.send_header("Content-Type", "application/json")
                self.end_headers()
                self.wfile.write(json.dumps({"status": "ok"}).encode())
            else:
                self.send_response(503)
                self.end_headers()
        else:
            self.send_response(200)
            self.end_headers()

def simulate_startup():
    time.sleep(5)
    HealthHandler.healthy = True

threading.Thread(target=simulate_startup, daemon=True).start()
import threading
server = http.server.HTTPServer(("0.0.0.0", 8000), HealthHandler)
server.serve_forever()
FROM python:3.12-slim

COPY healthcheck.py /app/healthcheck.py

HEALTHCHECK --interval=5s --timeout=2s --start-period=8s --retries=2 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1

CMD ["python", "/app/healthcheck.py"]

During the start period (8s), the health check runs but failures don't count toward retries. This allows the application to initialize before being considered unhealthy.

Application-Level Health Endpoint

from flask import Flask, jsonify
import redis
import psycopg2

app = Flask(__name__)

@app.route("/health")
def health():
    health_status = {"status": "healthy", "checks": {}}

    # Check Redis
    try:
        r = redis.Redis(host="redis", port=6379, socket_connect_timeout=2)
        r.ping()
        health_status["checks"]["redis"] = "ok"
    except Exception as e:
        health_status["checks"]["redis"] = f"unhealthy: {str(e)}"
        health_status["status"] = "unhealthy"

    # Check PostgreSQL
    try:
        conn = psycopg2.connect(
            host="db", dbname="myapp",
            user="user", password="pass",
            connect_timeout=2
        )
        cur = conn.cursor()
        cur.execute("SELECT 1")
        cur.close()
        conn.close()
        health_status["checks"]["database"] = "ok"
    except Exception as e:
        health_status["checks"]["database"] = f"unhealthy: {str(e)}"
        health_status["status"] = "unhealthy"

    status_code = 200 if health_status["status"] == "healthy" else 503
    return jsonify(health_status), status_code

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)
curl http://localhost:5000/health | python3 -m json.tool

Expected output:

{
  "status": "healthy",
  "checks": {
    "redis": "ok",
    "database": "ok"
  }
}

Restart Policy with Compose

# docker-compose.yml
services:
  web:
    image: nginx
    restart: always
    healthcheck:
      test: ["CMD", "wget", "--spider", "http://localhost"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s

  worker:
    image: myapp/worker
    restart: on-failure:5
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9090/ready || exit 1"]
      interval: 15s
      start_period: 30s

  db:
    image: postgres:16
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

Automated Recovery with Docker Events

import docker
import subprocess
import json

client = docker.from_env()

def handle_event(event):
    if event["Type"] == "container" and event["Action"] == "health_status":
        container_id = event["Actor"]["ID"]
        container = client.containers.get(container_id)
        health = container.attrs.get("State", {}).get("Health", {})

        if health.get("Status") == "unhealthy":
            print(f"Container {container.name} is unhealthy. Restarting...")
            container.restart()

            # Log the incident
            log_entry = {
                "container": container.name,
                "action": "restarted",
                "reason": "health_check_failure",
                "failing_streak": health.get("FailingStreak")
            }
            with open("/var/log/container-recovery.json", "a") as f:
                f.write(json.dumps(log_entry) + "\n")

# Monitor events
for event in client.events(decode=True):
    handle_event(event)

Common Mistakes

1. Using --restart always for One-Shot Jobs

A batch job that exits with code 0 (--restart always) restarts forever. Use --restart no for jobs or --restart on-failure if retry on error is desired.

2. Ignoring Start Period

Without --start-period, a container that takes 30 seconds to initialize appears unhealthy for 6 consecutive intervals (at 5s each), triggering automatic restarts in a crash loop.

3. Health Check with Infinite Retries

A health check command that hangs (e.g., curl without --connect-timeout) blocks Docker's health check system. Always set timeouts on health check commands.

4. No Health Check on Database Containers

pg_isready or mysqladmin ping as health checks let dependent services wait for the database to be ready, preventing the "app starts before DB" Race Condition.

5. Overlapping Restart and Swarm/Orchestrator Policies

If Docker Swarm or Kubernetes manages restart policies, don't also set Docker restart policies. The orchestrator's restart handling conflicts with Docker's.

6. Health Check Port vs Application Port

A health check that probes a different port than the application port gives false positives. The web server may be running on port 80 but the application on port 3000 may be unresponsive.

7. Too Frequent Health Checks

--interval=1s on 100 containers creates 100 health checks per second. Scale: 30s interval per container. For higher frequency, use external monitoring.

Practice Questions

1. What is the difference between always and unless-stopped?

Both restart automatically when the Docker daemon restarts. But unless-stopped respects a manual stop — if you run docker stop, it won't restart. always restarts even after a manual stop.

2. How does the start period affect health checks?

During the start period, health check failures don't count toward retries. The container is assumed to be starting up. After the start period expires, every failure counts.

3. What exit codes trigger on-failure restart?

Any non-zero exit code. Exit code 0 (clean exit) does not trigger restart. Exit codes 1-255 trigger the restart policy.

4. How do you view health check logs?

docker inspect --format='{{json .State.Health}}' <container> shows the full health check history, including output, exit codes, and timestamps of each check.

5. Challenge: Design a graceful shutdown pattern for health checks.

When draining traffic from a container (e.g., during rolling update), the health endpoint should return 503 so the load balancer removes the container from rotation. After a cooldown period, the container receives SIGTERM. Implement this with a health check endpoint that becomes unhealthy on signal.

Mini Project: Health Check Tester

import http.server
import json
import signal
import sys

class GracefulShutdownHandler(http.server.BaseHTTPRequestHandler):
    shutting_down = False

    def do_GET(self):
        if self.path == "/health":
            if GracefulShutdownHandler.shutting_down:
                self.send_response(503)
                self.end_headers()
                self.wfile.write(b"draining")
            else:
                self.send_response(200)
                self.send_header("Content-Type", "application/json")
                self.end_headers()
                self.wfile.write(json.dumps({
                    "status": "healthy",
                    "uptime_seconds": GracefulShutdownHandler.uptime
                }).encode())

def handle_sigterm(signum, frame):
    print("Received SIGTERM, entering drain mode...")
    GracefulShutdownHandler.shutting_down = True
    time.sleep(5)
    sys.exit(0)

import time
GracefulShutdownHandler.uptime = 0
signal.signal(signal.SIGTERM, handle_sigterm)

server = http.server.HTTPServer(("0.0.0.0", 8000), GracefulShutdownHandler)

def uptime_counter():
    while True:
        time.sleep(1)
        GracefulShutdownHandler.uptime += 1

threading.Thread(target=uptime_counter, daemon=True).start()
import threading
print("Server running on :8000")
server.serve_forever()
FROM python:3.12-slim
COPY graceful.py /app/graceful.py
HEALTHCHECK --interval=5s --timeout=2s --retries=2 --start-period=3s \
  CMD curl -f http://localhost:8000/health || exit 1
STOPSIGNAL SIGTERM
CMD ["python", "/app/graceful.py"]
docker build -t graceful-app:latest .
docker run -d --restart on-failure -p 8000:8000 --name graceful graceful-app:latest
sleep 5
docker stop graceful
# Check logs - should show drain mode before exit
docker logs graceful

Expected output:

Server running on :8000
Received SIGTERM, entering drain mode...

FAQ

What happens if a health check command itself fails with an error?

The health check is considered failed. If the failure count reaches --retries, the container status changes to "unhealthy." The container is not automatically killed — it's up to the orchestrator (Swarm, Kubernetes) or your monitoring to act on unhealthy status.

Can I use Docker restart policies with docker-compose?

Yes. The compose restart key supports the same values: no, always, on-failure, unless-stopped. The healthcheck block also supports the same parameters as the Dockerfile HEALTHCHECK instruction.

How do restart policies interact with docker swarm services?

Swarm services have their own restart policies (rollback, restart on failure). Do not set Docker restart policies on individual containers in a Swarm service — Swarm manages the container lifecycle.

What's Next

Docker Resource Limits
Docker Compose Health Checks
Docker Networking Deep Dive

Congratulations on completing this restart policies guide! Here's where to go from here:

  • Practice daily — Add health checks to every container you run
  • Build a project — Create a self-healing application with Docker restart + health checks
  • Explore related topics — Graceful shutdown patterns, readiness probes in Kubernetes, container lifecycle hooks
  • Join the community — Share your health check designs and get feedback

Remember: every expert was once a beginner. Keep recovering!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro