CI/CD Pipeline Troubleshooting -- Build Failures, Flaky Tests & Deployment Errors

DodaTech Updated 2026-06-23 9 min read

CI/CD pipeline failures block deployments and slow down development teams -- this guide shows you how to diagnose and fix broken builds, flaky tests, deployment rollbacks, and pipeline performance issues across GitHub Actions, GitLab CI, and Jenkins.

What You'll Learn

Why It Matters

A pipeline that fails 20% of the time destroys developer trust and slows delivery by 40%. Each flaky test that takes 5 minutes to re-run costs the team hours per week. Knowing how to debug pipeline failures efficiently is essential for any DevOps engineer.

Real-World Use

When your deployment pipeline fails with "error: failed to push" right before a release, tests pass locally but fail in CI, or a Docker build step takes 30 minutes because the cache was invalidated, these debugging techniques pinpoint the root cause.

Common CI/CD Pipeline Issues Table

Issue	Symptom	Platform	Fix
Build cache miss	Full rebuild every time	All	Pin cache keys and restore steps
Flaky test	Passes locally, fails in CI	All	Check race conditions, timing, and isolation
Secret not available	Auth failures in pipeline	All	Verify secret names and scope
Docker build slow	30+ minute builds	All	Optimize layer caching and use BuildKit
Stage timeout	Pipeline cancelled after limit	Jenkins, GitLab	Increase timeout or split stage
Deployment rollback	Health checks fail post-deploy	All	Check canary analysis and rollback strategy
Permission denied	Git push or API call fails	GitHub Actions	Update GITHUB_TOKEN permissions or PAT scope

Step-by-Step Fixes

Fix 1: Debug Build Cache Misses

# .github/workflows/build.yml -- Optimize caching
name: Build
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Cache npm dependencies
      - name: Cache npm
        uses: actions/cache@v3
        with:
          path: ~/.npm
          key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-npm-

      # Cache Docker layers
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Build and cache Docker image
        uses: docker/build-push-action@v5
        with:
          cache-from: type=gha
          cache-to: type=gha,mode=max
          push: false
          load: true
tags: myapp:latest

# Verify cache is being used locally
docker buildx build --cache-from=type=local,src=/tmp/docker-cache \
                    --cache-to=type=local,dest=/tmp/docker-cache \
                    -t myapp:latest .

# Check cache hit/miss in output
docker build . --progress=plain 2>&1 | grep "CACHED"

Expected output:

#7 [ 4/10] RUN npm ci
#7 CACHED

#8 [ 5/10] COPY . .
#8 0.5s  # Uncacheable step -- changes every build

Fix 2: Investigate Flaky Tests

# test_flaky.py -- Example of a flaky async test
import pytest
import asyncio
import random

# Bad: race condition due to no synchronization
@pytest.mark.asyncio
async def test_async_race_condition():
    """This test fails intermittently."""
    results = []
    async def worker(i):
        await asyncio.sleep(random.random() * 0.1)
        results.append(i)  # Race: append is not atomic in some contexts
    await asyncio.gather(*[worker(i) for i in range(10)])
    assert len(results) == 10  # Fails sometimes

# Good: use proper synchronization
@pytest.mark.asyncio
async def test_async_fixed():
    results = []
    lock = asyncio.Lock()
    async def worker(i):
        await asyncio.sleep(random.random() * 0.1)
        async with lock:
            results.append(i)
    await asyncio.gather(*[worker(i) for i in range(10)])
    assert len(results) == 10  # Always passes

# Run a flaky test multiple times to reproduce
for i in $(seq 1 20); do
  echo "Run $i:"
  pytest test_flaky.py -x --timeout=30 2>&1 | tail -1
done

# Use pytest-flakefinder to run tests multiple times
pip install pytest-flakefinder
pytest --flake-finder --flake-runs=10 test_flaky.py

# Run tests with random order to detect order dependencies
pytest --random-order test_flaky.py

Expected output:

Run 1: PASSED
Run 2: PASSED
Run 3: FAILED  # Intermittent!
Run 4: PASSED

Fix 3: Resolve Deployment Health Check Failures

# .github/workflows/deploy.yml -- Canary deployment with rollback
name: Deploy
on:
  workflow_dispatch:
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy canary (10% traffic)
        run: |
          kubectl set image deployment/myapp-canary myapp=myapp:${{ github.sha }}
          kubectl rollout status deployment/myapp-canary --timeout=5m

      - name: Health check canary
        run: |
          for i in $(seq 1 12); do
            status=$(curl -s -o /dev/null -w "%{http_code}" https://canary.example.com/health)
            if [ "$status" == "200" ]; then
              echo "Canary healthy"
              exit 0
            fi
            echo "Attempt $i: status $status, waiting 5s..."
            sleep 5
          done
          echo "Canary health check failed, rolling back"
          kubectl rollout undo deployment/myapp-canary
          exit 1

      - name: Promote to production
        run: |
          kubectl set image deployment/myapp-production myapp=myapp:${{ github.sha }}
          kubectl rollout status deployment/myapp-production --timeout=5m

# Check deployment history
kubectl rollout history deployment/myapp-production

# Rollback to previous version
kubectl rollout undo deployment/myapp-production --to-revision=3

# Check pod logs for the failing revision
kubectl logs -l app=myapp --tail=100 --prefix | grep -i error

Expected output:

deployment.apps/myapp-canary rolled out
Attempt 1: status 502, waiting 5s...
Attempt 2: status 503, waiting 5s...
Canary health check failed, rolling back
deployment.apps/myapp-canary rolled back

Fix 4: Fix Secret Management Issues

# .github/workflows/secret-check.yml -- Verify secrets before running
name: Deploy with Secrets
on: [push]
jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      # Check if required secrets are set
      HAS_TOKEN: ${{ secrets.DEPLOY_TOKEN != '' }}
    steps:
      - name: Validate secrets
        run: |
          if [ "$HAS_TOKEN" != "true" ]; then
            echo "::error::DEPLOY_TOKEN secret is not set. Add it in Settings -> Secrets."
            exit 1
          fi
          echo "All required secrets are configured"

      - name: Deploy
        env:
          DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
          API_KEY: ${{ secrets.API_KEY }}
        run: |
          # Secrets are now available as environment variables
          ./deploy.sh

# Check if a secret exists in GitHub Actions (via API)
gh secret list --repo owner/repo | grep DEPLOY_TOKEN

# Update a secret
gh secret set DEPLOY_TOKEN --body "your-token-value"

# Verify the secret is available in a pipeline step
echo "Secret length: ${#DEPLOY_TOKEN}"  # Should print > 0

Expected output:

All required secrets are configured
Deploying with token of length 40...
Deployment successful

Fix 5: Debug Pipeline Timeouts

# Jenkinsfile -- Fix stage timeouts
pipeline {
    agent any
    options {
        timeout(time: 1, unit: 'HOURS')  // Global pipeline timeout
    }
    stages {
        stage('Build') {
            options {
                timeout(time: 10, unit: 'MINUTES')
            }
            steps {
                sh 'make build'
            }
        }
        stage('Test') {
            options {
                timeout(time: 30, unit: 'MINUTES')
                retry(2)  // Retry flaky steps
            }
            parallel {
                stage('Unit tests') {
                    steps { sh 'make test-unit' }
                }
                stage('Integration tests') {
                    steps { sh 'make test-integration' }
                }
            }
        }
    }
    post {
        failure {
            echo "Pipeline failed at ${env.STAGE_NAME}"
            currentBuild.result = 'FAILURE'
        }
    }
}

# Check the slowest stage in a finished pipeline
# Jenkins: Navigate to the build -> Blue Ocean -> Stage view
# GitHub Actions: Actions tab -> workflow run -> job -> step timing
# Identify the stage with the longest wall clock time

# Reduce build time by splitting a monolith into parallel jobs
# Check for unnecessary steps (e.g., npm install on every run)

Expected output:

[Pipeline] { (Build)
[Pipeline] timeout
[Pipeline] { (Build - Timeout 10m)
...
[Pipeline] } (Build - Timeout)

CI/CD Pipeline Troubleshooting Flowchart

flowchart TD
    A[Pipeline Failure] --> B{Failure at which stage?}
    B -->|Build| C[Check cache hits and Docker layers]
    C --> D[Verify dependency installation]
    D --> E[Check for platform-specific issues]
    B -->|Test| F[Run tests locally with CI env]
    F --> G[Check for race conditions and timing]
    G --> H[Set random test order and flake runs]
    B -->|Deploy| I[Check health checks and rollback]
    I --> J[Inspect pod logs and events]
    J --> K[Verify secret permissions and scope]
    B -->|Timeout| L[Check stage durations]
    L --> M[Increase timeout or split stage]
    M --> N[Parallelize slow steps]
    E --> O[Pipeline Passes]
    H --> O
    K --> O
    N --> O

Prevention Tips

Use deterministic cache keys based on dependency file hashes to maximize cache hits
Run flaky tests 10 times in CI and mark them as xfail if they fail more than 2 times
Always verify required secrets exist at the start of a pipeline with a validation step
Set appropriate timeouts per stage instead of one global timeout
Implement canary deployments with automated rollback on health check failure
Use Docker BuildKit cache features (cache-from/cache-to) to speed up container builds

Practice Questions

How do you debug a test that passes locally but fails consistently in CI? Answer: Compare the local and CI environments: check Node.js/Python versions, environment variables, file permissions, and available memory. Run the test with CI=true environment variable locally. Check if the test depends on a specific timezone, locale, or file path that differs between environments.
What is the difference between a GitHub Actions cache action and a Docker BuildKit cache? Answer: The GitHub Actions cache action caches dependency directories (like ~/.npm, ~/.cache/pip) between workflow runs. Docker BuildKit cache (cache-from/cache-to type=gha) caches Docker build layers, allowing unchanged layers to skip rebuilds even on different runners. Both are needed for optimal CI/CD build times.
How do you handle secrets that expire or rotate while CI/CD pipelines reference them? Answer: Store secret expiration dates in a metadata file and add a scheduled workflow that checks expiry and alerts before rotation. Use a secrets manager like HashiCorp Vault or AWS Secrets Manager with dynamic credentials instead of static tokens. For GitHub Actions, rotate secrets via gh secret set in an admin-only workflow.

Challenge: Write a GitHub Actions workflow that runs a canary deployment, checks HTTP 200 on the health endpoint for 5 minutes, and auto-rolls back if any check fails. Answer:

name: Canary Deploy
on: [workflow_dispatch]
jobs:
  canary:
    runs-on: ubuntu-latest
    steps:
      - run: kubectl set image deploy/canary app=app:${{ github.sha }}
      - run: |
          for i in $(seq 1 30); do
            code=$(curl -s -o /dev/null -w "%{http_code}" https://canary.example.com/health)
            if [ "$code" != "200" ]; then
              kubectl rollout undo deploy/canary
              exit 1
            fi
            sleep 10
          done
      - run: kubectl set image deploy/prod app=app:${{ github.sha }}

Quick Reference

Issue	Diagnostic	Resolution
Build cache miss	Check CACHED in build output	Fix cache keys, use BuildKit
Flaky test	`pytest --flake-runs=10`	Fix race conditions, add retry
Secret missing	`gh secret list`	Set secret in repo settings
Health check fail	`kubectl logs` pods	Rollback, fix startup probe
Stage timeout	Check stage duration in logs	Increase timeout or parallelize

FAQ

How do you handle flaky integration tests that depend on external services?

Use service containers (GitHub Actions services:) or test containers to spin up dependencies in the pipeline instead of relying on shared environments. Set explicit health checks before the test step starts. If the external service is genuinely unreliable, wrap the test in a retry mechanism (max 3 retries with exponential backoff) and log the flakiness rate for infrastructure teams.

What is the best strategy for Docker layer caching in CI/CD across different runners?

Use GitHub Actions cache backend for BuildKit (cache-from: type=gha, cache-to: type=gha,mode=max) for multi-runner caching. The mode=max option caches all layers, not just exported ones, which speeds up subsequent builds even if the Dockerfile changes. For self-hosted runners, use a shared network volume with type=local.

Why does `GITHUB_TOKEN` sometimes fail with "Resource not accessible by integration"?

The default GITHUB_TOKEN has read-only permissions on most resources. To modify packages, secrets, or environments, you need to explicitly grant permissions in the workflow YAML:

permissions:
  contents: write
  packages: write
  id-token: write

For cross-repo access, use a personal access token (PAT) stored as a secret instead.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Memory Leak Detection & Fixing -- Heap Analysis, Object Retention & Garbage Collection Next → High CPU Usage Troubleshooting -- Process Analysis, Thread Dumps & Performance Tuning

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Troubleshooting

CI/CD Pipeline Troubleshooting -- Build Failures, Flaky Tests & Deployment Errors

What You'll Learn

Why It Matters

Real-World Use

Common CI/CD Pipeline Issues Table

Step-by-Step Fixes

Fix 1: Debug Build Cache Misses

Fix 2: Investigate Flaky Tests

Fix 3: Resolve Deployment Health Check Failures

Fix 4: Fix Secret Management Issues

Fix 5: Debug Pipeline Timeouts

CI/CD Pipeline Troubleshooting Flowchart

Prevention Tips

Practice Questions

Quick Reference

FAQ

How do you handle flaky integration tests that depend on external services?

What is the best strategy for Docker layer caching in CI/CD across different runners?

Why does GITHUB_TOKEN sometimes fail with "Resource not accessible by integration"?

Why does `GITHUB_TOKEN` sometimes fail with "Resource not accessible by integration"?