Skip to content

How to Fix Envoy Cluster Health Check Failure

DodaTech Updated 2026-06-24 2 min read

In this tutorial, you'll learn about How to Fix Envoy Cluster Health Check Failure. We cover key concepts, practical examples, and best practices.

Envoy cluster shows hosts: 0 healthy, 1 unhealthy and upstream requests return 503 upstream connect error — the health check configuration is failing or endpoints are unreachable.

The Problem

$ curl -s http://localhost:9901/clusters | grep -E "health|healthy"
cluster.backend.upstream_cx_total: 10
cluster.backend.membership_healthy: 0
cluster.backend.membership_total: 1

Step-by-Step Fix

Step 1: Check cluster status

curl -s http://localhost:9901/clusters | grep -A5 "backend::"

Step 2: Configure active health checks

clusters:
  - name: backend
    connect_timeout: 0.25s
    health_checks:
      - timeout: 1s
        interval: 5s
        unhealthy_threshold: 3
        healthy_threshold: 1
        http_health_check:
          path: /health

Step 3: Verify endpoints are reachable

curl -v http://backend:3000/health

Step 4: Check outlier detection

outlier_detection:
  consecutive_5xx: 3
  interval: 10s
  base_ejection_time: 30s

Step 5: Reset endpoints manually

curl -X POST http://localhost:9901/healthcheck/fail/backend
curl -X POST http://localhost:9901/healthcheck/ok/backend

Prevention Tips

  • Set realistic health check timeouts based on endpoint latency
  • Monitor membership changes via Envoy stats
  • Use passive health check (outlier detection) alongside active checks
  • Configure both HTTP and TCP health checks for redundancy

Common Mistakes with cluster health

  1. Using return to exit a function early instead of wrapping a pure value in the monad
  2. Mixing let bindings with <- bindings in do notation, producing type errors
  3. Overlapping type class instances that cause GHC to reject the program with ambiguous dispatch errors

These mistakes appear frequently in real-world ENVOY code. DodaTech's contributors have identified these patterns through analysis of open-source projects and production systems.

Practice Exercise

Write a pure function that safely divides two integers using Maybe, then test it with edge cases like division by zero and negative numbers.

This exercise reinforces the concepts covered in this guide. Try implementing it before checking online solutions.

FAQ

### Why does Envoy mark healthy endpoints as unhealthy despite responding to health checks?

The health check timeout might be too short for your endpoint's response time. Increase the timeout field in the health check config. Also check that the health check path returns HTTP 200 within the timeout window.

How does outlier detection differ from active health checks?

Active health checks periodically probe endpoints with requests. Outlier detection monitors actual request success rates and ejects failing endpoints. Use both together for robust health management.

What does "membership_healthy" vs "membership_total" mean?

membership_total is the number of endpoints in the cluster. membership_healthy is the number passing active health checks. A discrepancy indicates endpoints that are failing or have not completed their first health check.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro