Docker Container Troubleshooting -- Health Checks, Resource Limits & Networking
Docker containers that fail health checks, get OOM-killed, or cannot communicate across networks are everyday challenges in containerized environments -- this guide shows you how to diagnose and fix these issues using Docker's built-in inspection tools and best practices.
What You'll Learn
Why It Matters
A container that passes a health check in development but fails in production means your app is unreliable. Understanding how to inspect container state, resource usage, and network topology is crucial for anyone running Docker in production.
Real-World Use
When your container orchestration platform repeatedly restarts a service due to failed health probes, your microservices cannot reach each other across a custom Docker network, or a memory leak causes OOM kills every 12 hours, these techniques pinpoint the root cause.
Common Docker Container Issues Table
| Issue | Symptom | Cause | Fix |
|---|---|---|---|
| Health check failing | Container marked "unhealthy" | Application not responding on expected port | Check HEALTHCHECK interval and endpoint |
| OOM killed | Exit code 137, "Killed" in logs | Container exceeds memory limit | Increase memory or fix memory leak |
| Cross-container networking | Cannot reach service by container name | Not on the same Docker network | Attach both containers to the same network |
| Orphaned containers | Disk space filling up | Stopped containers not cleaned up | Use docker container prune regularly |
| Layer cache bloat | Image builds are slow | Large layers invalidating cache | Optimize Dockerfile layer ordering |
| Timezone/locale mismatch | Wrong timezone inside container | Container uses UTC by default | Set TZ environment variable |
Step-by-Step Fixes
Fix 1: Debug Failing Health Checks
# Inspect current health status
docker inspect --format '{{json .State.Health}}' my-container
# View health check logs
docker inspect my-container | jq '.[].State.Health.Log[-1]'
# Test the health endpoint manually
docker exec my-container curl -f http://localhost:8080/health
# Override health check for debugging
docker run --health-cmd="curl -f http://localhost/health || exit 1" \
--health-interval=5s \
--health-retries=3 \
my-image
# View container logs around health check failures
docker logs my-container --since 5m | grep -i health
Expected output:
{
"Status": "unhealthy",
"FailingStreak": 5,
"Log": [
{
"Start": "2026-06-23T10:00:00Z",
"Output": "Connection refused",
"ExitCode": 1
}
]
}
Fix 2: Investigate OOM Kills
# Check if the container was OOM-killed
docker inspect my-container --format '{{.State.OOMKilled}}'
# Check the exit code (137 = OOM)
docker inspect my-container --format '{{.State.ExitCode}}'
# View resource usage history
docker stats --no-stream my-container
# Set memory limits
docker run -m 512m --memory-reservation 256m my-image
# Update limits on a running container
docker update --memory 1g --memory-swap 1g my-container
Expected output:
true
137
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O
abc123def456 my-app 45.2% 478.2MiB / 512MiB 93.4% 1.2GB / 3.4GB
Fix 3: Fix Container Networking
# List all Docker networks
docker network ls
# Inspect which networks a container is on
docker inspect my-container --format '{{json .NetworkSettings.Networks}}' | jq
# Connect two containers to the same network
docker network create my-network
docker network connect my-network container-a
docker network connect my-network container-b
# Test communication using container names
docker exec container-a ping container-b
# Check DNS resolution
docker exec container-a nslookup container-b
Expected output:
PING container-b (172.18.0.3) 56(84) bytes of data.
64 bytes from container-b.my-network (172.18.0.3): icmp_seq=1 ttl=64 time=0.123 ms
Fix 4: Clean Up Orphaned Resources
# List all stopped containers
docker ps -a --filter "status=exited"
# Remove all stopped containers
docker container prune -f
# Remove unused images
docker image prune -a -f
# Remove all unused resources (containers, images, networks)
docker system prune -a --volumes -f
# View disk usage by Docker objects
docker system df
Expected output:
Total reclaimed space: 2.1GB
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 5 2 3.2GB 1.8GB (56%)
Containers 12 3 512MB 480MB (93%)
Local Volumes 8 4 1.1GB 680MB (61%)
Build Cache 0 0 0B 0B
Fix 5: Optimize Docker Layer Caching
# Bad: invalidates cache on every code change
FROM node:18
COPY . /app
RUN npm install
CMD ["node", "index.js"]
# Good: install dependencies before copying source
FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
CMD ["node", "index.js"]
Expected output:
Step 4/6 : RUN npm install
---> Using cache
---> abc123def456
Step 5/6 : COPY . .
---> 789def012abc # Only this layer rebuilds
Docker Container Troubleshooting Flowchart
flowchart TD
A[Container Issue] --> B{Running?}
B -->|No| C[Check exit code]
C -->|137| D[OOM killed -- increase memory]
C -->|139| E[SIGSEGV -- segfault in app]
C -->|0 or 1| F[Application error -- check logs]
B -->|Yes| G{Health check passing?}
G -->|No| H[Inspect health status]
H --> I[Test endpoint manually with exec]
G -->|Yes| J{Network reachable?}
J -->|No| K[Check network connections]
K --> L[Connect to shared network]
J -->|Yes| M{Package installed?}
M -->|No| N[Check timezone or locale]
N --> O[Set TZ environment variable]
D --> P[Container Healthy]
F --> P
I --> P
L --> P
O --> P
Prevention Tips
- Define HEALTHCHECK in every Dockerfile with appropriate interval, timeout, and retries
- Set memory limits on every container with
--memoryand--memory-reservationto prevent noisy neighbors - Put related containers on a custom Docker network instead of using
--link - Schedule
docker system prune -fvia cron to reclaim disk space weekly - Order Dockerfile commands from least-changing to most-changing to maximize layer cache hits
- Use Docker Compose with health check dependencies using
depends_onwithcondition: service_healthy
Practice Questions
What exit code does a container get when it is killed by the OOM killer and how do you check it? Answer: Exit code 137 (128 + 9 = SIGKILL). Check with
docker inspect --format '{{.State.OOMKilled}}' <container>anddocker inspect --format '{{.State.ExitCode}}' <container>.How do you debug DNS resolution issues between containers on the same Docker host? Answer: Ensure both containers are on the same user-defined bridge network. Use
docker exec <container> ping <other-container>to test name resolution. User-defined networks provide automatic DNS resolution using container names.What is the difference between
docker system pruneanddocker container prune? Answer:docker container pruneremoves only stopped containers.docker system pruneremoves stopped containers, unused networks, dangling images, and build cache. Adding--volumesalso removes unused volumes.Challenge: Write a Docker health check that pings both the application endpoint and a database dependency, returning unhealthy if either is unreachable. Answer:
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \ CMD curl -f http://localhost:8080/health && \ nc -zv db-host 5432 || exit 1
Quick Reference
| Issue | Check Command | Fix |
|---|---|---|
| Health check failing | docker inspect --format '{{.State.Health}}' <id> |
Fix endpoint or interval |
| OOM killed | docker inspect --format '{{.State.OOMKilled}}' <id> |
docker update --memory 1g <id> |
| Network isolation | docker inspect <id> --format '{{.NetworkSettings.Networks}}' |
docker network connect |
| Disk full from Docker | docker system df |
docker system prune -a --volumes |
| Slow image builds | Check Dockerfile layer order | Copy package.json before source |
FAQ
Why does my container pass health checks locally but fail in production?
The most common causes are different timeouts (production orchestrators like Kubernetes have shorter probe intervals), different resource limits (memory-constrained pods may not respond in time), or different network policies blocking the health endpoint. Match your HEALTHCHECK interval, timeout, and retries exactly to your orchestrator's liveness probe settings.
How do you handle containers that need to run with the host timezone instead of UTC?
Set the TZ environment variable when running the container: docker run -e TZ=America/New_York my-image. For systemd-based Docker hosts, you can also bind-mount /etc/localtime: docker run -v /etc/localtime:/etc/localtime:ro my-image.
What causes the "/bin/sh: executable file not found" error even when the binary exists in the image?
This typically happens when the binary is compiled for a different architecture (e.g., ARM binary on x86 host), the binary does not have execute permissions, or the binary depends on shared libraries that are not present in the image. Check with docker run --rm my-image file /path/to/binary and ldd /path/to/binary (using exec).
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro