Docker Container Troubleshooting -- Health Checks, Resource Limits & Networking

DodaTech Updated 2026-06-23 7 min read

Docker containers that fail health checks, get OOM-killed, or cannot communicate across networks are everyday challenges in containerized environments -- this guide shows you how to diagnose and fix these issues using Docker's built-in inspection tools and best practices.

What You'll Learn

Why It Matters

A container that passes a health check in development but fails in production means your app is unreliable. Understanding how to inspect container state, resource usage, and network topology is crucial for anyone running Docker in production.

Real-World Use

When your container orchestration platform repeatedly restarts a service due to failed health probes, your microservices cannot reach each other across a custom Docker network, or a memory leak causes OOM kills every 12 hours, these techniques pinpoint the root cause.

Common Docker Container Issues Table

Issue	Symptom	Cause	Fix
Health check failing	Container marked "unhealthy"	Application not responding on expected port	Check HEALTHCHECK interval and endpoint
OOM killed	Exit code 137, "Killed" in logs	Container exceeds memory limit	Increase memory or fix memory leak
Cross-container networking	Cannot reach service by container name	Not on the same Docker network	Attach both containers to the same network
Orphaned containers	Disk space filling up	Stopped containers not cleaned up	Use `docker container prune` regularly
Layer cache bloat	Image builds are slow	Large layers invalidating cache	Optimize Dockerfile layer ordering
Timezone/locale mismatch	Wrong timezone inside container	Container uses UTC by default	Set TZ environment variable

Step-by-Step Fixes

Fix 1: Debug Failing Health Checks

# Inspect current health status
docker inspect --format '{{json .State.Health}}' my-container

# View health check logs
docker inspect my-container | jq '.[].State.Health.Log[-1]'

# Test the health endpoint manually
docker exec my-container curl -f http://localhost:8080/health

# Override health check for debugging
docker run --health-cmd="curl -f http://localhost/health || exit 1" \
           --health-interval=5s \
           --health-retries=3 \
           my-image

# View container logs around health check failures
docker logs my-container --since 5m | grep -i health

Expected output:

{
  "Status": "unhealthy",
  "FailingStreak": 5,
  "Log": [
    {
      "Start": "2026-06-23T10:00:00Z",
      "Output": "Connection refused",
      "ExitCode": 1
    }
  ]
}

Fix 2: Investigate OOM Kills

# Check if the container was OOM-killed
docker inspect my-container --format '{{.State.OOMKilled}}'

# Check the exit code (137 = OOM)
docker inspect my-container --format '{{.State.ExitCode}}'

# View resource usage history
docker stats --no-stream my-container

# Set memory limits
docker run -m 512m --memory-reservation 256m my-image

# Update limits on a running container
docker update --memory 1g --memory-swap 1g my-container

Expected output:

true
137

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT    MEM %     NET I/O
abc123def456   my-app    45.2%     478.2MiB / 512MiB    93.4%     1.2GB / 3.4GB

Fix 3: Fix Container Networking

# List all Docker networks
docker network ls

# Inspect which networks a container is on
docker inspect my-container --format '{{json .NetworkSettings.Networks}}' | jq

# Connect two containers to the same network
docker network create my-network
docker network connect my-network container-a
docker network connect my-network container-b

# Test communication using container names
docker exec container-a ping container-b

# Check DNS resolution
docker exec container-a nslookup container-b

Expected output:

PING container-b (172.18.0.3) 56(84) bytes of data.
64 bytes from container-b.my-network (172.18.0.3): icmp_seq=1 ttl=64 time=0.123 ms

Fix 4: Clean Up Orphaned Resources

# List all stopped containers
docker ps -a --filter "status=exited"

# Remove all stopped containers
docker container prune -f

# Remove unused images
docker image prune -a -f

# Remove all unused resources (containers, images, networks)
docker system prune -a --volumes -f

# View disk usage by Docker objects
docker system df

Expected output:

Total reclaimed space: 2.1GB

TYPE                TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images              5         2         3.2GB     1.8GB (56%)
Containers          12        3         512MB     480MB (93%)
Local Volumes       8         4         1.1GB     680MB (61%)
Build Cache         0         0         0B        0B

Fix 5: Optimize Docker Layer Caching

# Bad: invalidates cache on every code change
FROM node:18
COPY . /app
RUN npm install
CMD ["node", "index.js"]

# Good: install dependencies before copying source
FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
CMD ["node", "index.js"]

Expected output:

Step 4/6 : RUN npm install
 ---> Using cache
 ---> abc123def456

Step 5/6 : COPY . .
 ---> 789def012abc  # Only this layer rebuilds

Docker Container Troubleshooting Flowchart

flowchart TD
    A[Container Issue] --> B{Running?}
    B -->|No| C[Check exit code]
    C -->|137| D[OOM killed -- increase memory]
    C -->|139| E[SIGSEGV -- segfault in app]
    C -->|0 or 1| F[Application error -- check logs]
    B -->|Yes| G{Health check passing?}
    G -->|No| H[Inspect health status]
    H --> I[Test endpoint manually with exec]
    G -->|Yes| J{Network reachable?}
    J -->|No| K[Check network connections]
    K --> L[Connect to shared network]
    J -->|Yes| M{Package installed?}
    M -->|No| N[Check timezone or locale]
    N --> O[Set TZ environment variable]
    D --> P[Container Healthy]
    F --> P
    I --> P
    L --> P
    O --> P

Prevention Tips

Define HEALTHCHECK in every Dockerfile with appropriate interval, timeout, and retries
Set memory limits on every container with --memory and --memory-reservation to prevent noisy neighbors
Put related containers on a custom Docker network instead of using --link
Schedule docker system prune -f via cron to reclaim disk space weekly
Order Dockerfile commands from least-changing to most-changing to maximize layer cache hits
Use Docker Compose with health check dependencies using depends_on with condition: service_healthy

Practice Questions

What exit code does a container get when it is killed by the OOM killer and how do you check it? Answer: Exit code 137 (128 + 9 = SIGKILL). Check with docker inspect --format '{{.State.OOMKilled}}' <container> and docker inspect --format '{{.State.ExitCode}}' <container>.
How do you debug DNS resolution issues between containers on the same Docker host? Answer: Ensure both containers are on the same user-defined bridge network. Use docker exec <container> ping <other-container> to test name resolution. User-defined networks provide automatic DNS resolution using container names.
What is the difference between docker system prune and docker container prune? Answer: docker container prune removes only stopped containers. docker system prune removes stopped containers, unused networks, dangling images, and build cache. Adding --volumes also removes unused volumes.

Challenge: Write a Docker health check that pings both the application endpoint and a database dependency, returning unhealthy if either is unreachable. Answer:

HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:8080/health && \
      nc -zv db-host 5432 || exit 1

Quick Reference

Issue	Check Command	Fix
Health check failing	`docker inspect --format '{{.State.Health}}' <id>`	Fix endpoint or interval
OOM killed	`docker inspect --format '{{.State.OOMKilled}}' <id>`	`docker update --memory 1g <id>`
Network isolation	`docker inspect <id> --format '{{.NetworkSettings.Networks}}'`	`docker network connect`
Disk full from Docker	`docker system df`	`docker system prune -a --volumes`
Slow image builds	Check Dockerfile layer order	Copy package.json before source

FAQ

Why does my container pass health checks locally but fail in production?

The most common causes are different timeouts (production orchestrators like Kubernetes have shorter probe intervals), different resource limits (memory-constrained pods may not respond in time), or different network policies blocking the health endpoint. Match your HEALTHCHECK interval, timeout, and retries exactly to your orchestrator's liveness probe settings.

How do you handle containers that need to run with the host timezone instead of UTC?

Set the TZ environment variable when running the container: docker run -e TZ=America/New_York my-image. For systemd-based Docker hosts, you can also bind-mount /etc/localtime: docker run -v /etc/localtime:/etc/localtime:ro my-image.

What causes the "/bin/sh: executable file not found" error even when the binary exists in the image?

This typically happens when the binary is compiled for a different architecture (e.g., ARM binary on x86 host), the binary does not have execute permissions, or the binary depends on shared libraries that are not present in the image. Check with docker run --rm my-image file /path/to/binary and ldd /path/to/binary (using exec).

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Network Troubleshooting Guide -- Latency, Packet Loss & Bandwidth Issues Next → Database Performance Troubleshooting -- Slow Queries, Connection Pooling & Indexing

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Troubleshooting