CI/CD Pipeline Troubleshooting -- Build Failures, Flaky Tests & Deployment Errors
CI/CD pipeline failures block deployments and slow down development teams -- this guide shows you how to diagnose and fix broken builds, flaky tests, deployment rollbacks, and pipeline performance issues across GitHub Actions, GitLab CI, and Jenkins.
What You'll Learn
Why It Matters
A pipeline that fails 20% of the time destroys developer trust and slows delivery by 40%. Each flaky test that takes 5 minutes to re-run costs the team hours per week. Knowing how to debug pipeline failures efficiently is essential for any DevOps engineer.
Real-World Use
When your deployment pipeline fails with "error: failed to push" right before a release, tests pass locally but fail in CI, or a Docker build step takes 30 minutes because the cache was invalidated, these debugging techniques pinpoint the root cause.
Common CI/CD Pipeline Issues Table
| Issue | Symptom | Platform | Fix |
|---|---|---|---|
| Build cache miss | Full rebuild every time | All | Pin cache keys and restore steps |
| Flaky test | Passes locally, fails in CI | All | Check race conditions, timing, and isolation |
| Secret not available | Auth failures in pipeline | All | Verify secret names and scope |
| Docker build slow | 30+ minute builds | All | Optimize layer caching and use BuildKit |
| Stage timeout | Pipeline cancelled after limit | Jenkins, GitLab | Increase timeout or split stage |
| Deployment rollback | Health checks fail post-deploy | All | Check canary analysis and rollback strategy |
| Permission denied | Git push or API call fails | GitHub Actions | Update GITHUB_TOKEN permissions or PAT scope |
Step-by-Step Fixes
Fix 1: Debug Build Cache Misses
# .github/workflows/build.yml -- Optimize caching
name: Build
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Cache npm dependencies
- name: Cache npm
uses: actions/cache@v3
with:
path: ~/.npm
key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-npm-
# Cache Docker layers
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and cache Docker image
uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
push: false
load: true
tags: myapp:latest
# Verify cache is being used locally
docker buildx build --cache-from=type=local,src=/tmp/docker-cache \
--cache-to=type=local,dest=/tmp/docker-cache \
-t myapp:latest .
# Check cache hit/miss in output
docker build . --progress=plain 2>&1 | grep "CACHED"
Expected output:
#7 [ 4/10] RUN npm ci
#7 CACHED
#8 [ 5/10] COPY . .
#8 0.5s # Uncacheable step -- changes every build
Fix 2: Investigate Flaky Tests
# test_flaky.py -- Example of a flaky async test
import pytest
import asyncio
import random
# Bad: race condition due to no synchronization
@pytest.mark.asyncio
async def test_async_race_condition():
"""This test fails intermittently."""
results = []
async def worker(i):
await asyncio.sleep(random.random() * 0.1)
results.append(i) # Race: append is not atomic in some contexts
await asyncio.gather(*[worker(i) for i in range(10)])
assert len(results) == 10 # Fails sometimes
# Good: use proper synchronization
@pytest.mark.asyncio
async def test_async_fixed():
results = []
lock = asyncio.Lock()
async def worker(i):
await asyncio.sleep(random.random() * 0.1)
async with lock:
results.append(i)
await asyncio.gather(*[worker(i) for i in range(10)])
assert len(results) == 10 # Always passes
# Run a flaky test multiple times to reproduce
for i in $(seq 1 20); do
echo "Run $i:"
pytest test_flaky.py -x --timeout=30 2>&1 | tail -1
done
# Use pytest-flakefinder to run tests multiple times
pip install pytest-flakefinder
pytest --flake-finder --flake-runs=10 test_flaky.py
# Run tests with random order to detect order dependencies
pytest --random-order test_flaky.py
Expected output:
Run 1: PASSED
Run 2: PASSED
Run 3: FAILED # Intermittent!
Run 4: PASSED
Fix 3: Resolve Deployment Health Check Failures
# .github/workflows/deploy.yml -- Canary deployment with rollback
name: Deploy
on:
workflow_dispatch:
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy canary (10% traffic)
run: |
kubectl set image deployment/myapp-canary myapp=myapp:${{ github.sha }}
kubectl rollout status deployment/myapp-canary --timeout=5m
- name: Health check canary
run: |
for i in $(seq 1 12); do
status=$(curl -s -o /dev/null -w "%{http_code}" https://canary.example.com/health)
if [ "$status" == "200" ]; then
echo "Canary healthy"
exit 0
fi
echo "Attempt $i: status $status, waiting 5s..."
sleep 5
done
echo "Canary health check failed, rolling back"
kubectl rollout undo deployment/myapp-canary
exit 1
- name: Promote to production
run: |
kubectl set image deployment/myapp-production myapp=myapp:${{ github.sha }}
kubectl rollout status deployment/myapp-production --timeout=5m
# Check deployment history
kubectl rollout history deployment/myapp-production
# Rollback to previous version
kubectl rollout undo deployment/myapp-production --to-revision=3
# Check pod logs for the failing revision
kubectl logs -l app=myapp --tail=100 --prefix | grep -i error
Expected output:
deployment.apps/myapp-canary rolled out
Attempt 1: status 502, waiting 5s...
Attempt 2: status 503, waiting 5s...
Canary health check failed, rolling back
deployment.apps/myapp-canary rolled back
Fix 4: Fix Secret Management Issues
# .github/workflows/secret-check.yml -- Verify secrets before running
name: Deploy with Secrets
on: [push]
jobs:
deploy:
runs-on: ubuntu-latest
env:
# Check if required secrets are set
HAS_TOKEN: ${{ secrets.DEPLOY_TOKEN != '' }}
steps:
- name: Validate secrets
run: |
if [ "$HAS_TOKEN" != "true" ]; then
echo "::error::DEPLOY_TOKEN secret is not set. Add it in Settings -> Secrets."
exit 1
fi
echo "All required secrets are configured"
- name: Deploy
env:
DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
API_KEY: ${{ secrets.API_KEY }}
run: |
# Secrets are now available as environment variables
./deploy.sh
# Check if a secret exists in GitHub Actions (via API)
gh secret list --repo owner/repo | grep DEPLOY_TOKEN
# Update a secret
gh secret set DEPLOY_TOKEN --body "your-token-value"
# Verify the secret is available in a pipeline step
echo "Secret length: ${#DEPLOY_TOKEN}" # Should print > 0
Expected output:
All required secrets are configured
Deploying with token of length 40...
Deployment successful
Fix 5: Debug Pipeline Timeouts
# Jenkinsfile -- Fix stage timeouts
pipeline {
agent any
options {
timeout(time: 1, unit: 'HOURS') // Global pipeline timeout
}
stages {
stage('Build') {
options {
timeout(time: 10, unit: 'MINUTES')
}
steps {
sh 'make build'
}
}
stage('Test') {
options {
timeout(time: 30, unit: 'MINUTES')
retry(2) // Retry flaky steps
}
parallel {
stage('Unit tests') {
steps { sh 'make test-unit' }
}
stage('Integration tests') {
steps { sh 'make test-integration' }
}
}
}
}
post {
failure {
echo "Pipeline failed at ${env.STAGE_NAME}"
currentBuild.result = 'FAILURE'
}
}
}
# Check the slowest stage in a finished pipeline
# Jenkins: Navigate to the build -> Blue Ocean -> Stage view
# GitHub Actions: Actions tab -> workflow run -> job -> step timing
# Identify the stage with the longest wall clock time
# Reduce build time by splitting a monolith into parallel jobs
# Check for unnecessary steps (e.g., npm install on every run)
Expected output:
[Pipeline] { (Build)
[Pipeline] timeout
[Pipeline] { (Build - Timeout 10m)
...
[Pipeline] } (Build - Timeout)
CI/CD Pipeline Troubleshooting Flowchart
flowchart TD
A[Pipeline Failure] --> B{Failure at which stage?}
B -->|Build| C[Check cache hits and Docker layers]
C --> D[Verify dependency installation]
D --> E[Check for platform-specific issues]
B -->|Test| F[Run tests locally with CI env]
F --> G[Check for race conditions and timing]
G --> H[Set random test order and flake runs]
B -->|Deploy| I[Check health checks and rollback]
I --> J[Inspect pod logs and events]
J --> K[Verify secret permissions and scope]
B -->|Timeout| L[Check stage durations]
L --> M[Increase timeout or split stage]
M --> N[Parallelize slow steps]
E --> O[Pipeline Passes]
H --> O
K --> O
N --> O
Prevention Tips
- Use deterministic cache keys based on dependency file hashes to maximize cache hits
- Run flaky tests 10 times in CI and mark them as
xfailif they fail more than 2 times - Always verify required secrets exist at the start of a pipeline with a validation step
- Set appropriate timeouts per stage instead of one global timeout
- Implement canary deployments with automated rollback on health check failure
- Use Docker BuildKit cache features (
cache-from/cache-to) to speed up container builds
Practice Questions
How do you debug a test that passes locally but fails consistently in CI? Answer: Compare the local and CI environments: check Node.js/Python versions, environment variables, file permissions, and available memory. Run the test with
CI=trueenvironment variable locally. Check if the test depends on a specific timezone, locale, or file path that differs between environments.What is the difference between a GitHub Actions
cacheaction and a Docker BuildKit cache? Answer: The GitHub Actionscacheaction caches dependency directories (like~/.npm,~/.cache/pip) between workflow runs. Docker BuildKit cache (cache-from/cache-totype=gha) caches Docker build layers, allowing unchanged layers to skip rebuilds even on different runners. Both are needed for optimal CI/CD build times.How do you handle secrets that expire or rotate while CI/CD pipelines reference them? Answer: Store secret expiration dates in a metadata file and add a scheduled workflow that checks expiry and alerts before rotation. Use a secrets manager like HashiCorp Vault or AWS Secrets Manager with dynamic credentials instead of static tokens. For GitHub Actions, rotate secrets via
gh secret setin an admin-only workflow.Challenge: Write a GitHub Actions workflow that runs a canary deployment, checks HTTP 200 on the health endpoint for 5 minutes, and auto-rolls back if any check fails. Answer:
name: Canary Deploy on: [workflow_dispatch] jobs: canary: runs-on: ubuntu-latest steps: - run: kubectl set image deploy/canary app=app:${{ github.sha }} - run: | for i in $(seq 1 30); do code=$(curl -s -o /dev/null -w "%{http_code}" https://canary.example.com/health) if [ "$code" != "200" ]; then kubectl rollout undo deploy/canary exit 1 fi sleep 10 done - run: kubectl set image deploy/prod app=app:${{ github.sha }}
Quick Reference
| Issue | Diagnostic | Resolution |
|---|---|---|
| Build cache miss | Check CACHED in build output | Fix cache keys, use BuildKit |
| Flaky test | pytest --flake-runs=10 |
Fix race conditions, add retry |
| Secret missing | gh secret list |
Set secret in repo settings |
| Health check fail | kubectl logs pods |
Rollback, fix startup probe |
| Stage timeout | Check stage duration in logs | Increase timeout or parallelize |
FAQ
How do you handle flaky integration tests that depend on external services?
Use service containers (GitHub Actions services:) or test containers to spin up dependencies in the pipeline instead of relying on shared environments. Set explicit health checks before the test step starts. If the external service is genuinely unreliable, wrap the test in a retry mechanism (max 3 retries with exponential backoff) and log the flakiness rate for infrastructure teams.
What is the best strategy for Docker layer caching in CI/CD across different runners?
Use GitHub Actions cache backend for BuildKit (cache-from: type=gha, cache-to: type=gha,mode=max) for multi-runner caching. The mode=max option caches all layers, not just exported ones, which speeds up subsequent builds even if the Dockerfile changes. For self-hosted runners, use a shared network volume with type=local.
Why does GITHUB_TOKEN sometimes fail with "Resource not accessible by integration"?
The default GITHUB_TOKEN has read-only permissions on most resources. To modify packages, secrets, or environments, you need to explicitly grant permissions in the workflow YAML:
permissions:
contents: write
packages: write
id-token: write
For cross-repo access, use a personal access token (PAT) stored as a secret instead.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro