LitmusChaos Advanced — Workflows, GitOps & Resilience Scores
In this tutorial, you'll learn about LitmusChaos Advanced. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Litmus advanced capabilities include workflow Orchestration with conditional branching, GitOps integration through ArgoCD and Flux, automated resilience scoring, and custom probe types. Chaos Engineering teams use these features to embed Resilience Testing directly into their delivery pipelines.
What You Will Learn
This tutorial teaches you how to create LitmusChaos workflows with conditional logic, integrate chaos experiments into GitOps pipelines, interpret resilience scores, and write custom HTTP and command probes for experiment validation.
Why It Matters
Manual chaos experiments do not scale. LitmusChaos advanced features let you define experiments as code, run them automatically as part of deployments, and score every service on its resilience. This transforms Chaos Engineering from an occasional exercise into a continuous quality gate.
Real-World Use
DodaTech runs a LitmusChaos workflow after every deployment to the Doda Browser backend. The workflow runs six experiments in sequence, and the resilience score must be above 85 percent for the deployment to proceed to the next environment.
Prerequisites
Before starting you should understand:
- Basic Litmus installation and experiment execution
- Kubernetes custom resources and controllers
- GitOps concepts (ArgoCD or Flux)
- CI/CD pipeline basics
Step 1: Create a LitmusChaos Workflow with Conditions
LitmusChaos workflows support conditional branching based on experiment results:
# advanced-workflow.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosWorkflow
metadata:
name: deployment-qualification
spec:
workflow:
steps:
- name: pod-delete-auth
type: experiment
experimentRef: pod-delete
spec:
target:
appLabel: app=auth-service
namespace: staging
duration: 30s
- name: network-loss-api
type: experiment
experimentRef: network-loss
spec:
target:
appLabel: app=api-gateway
namespace: staging
duration: 60s
packetLossPercent: 20
dependsOn:
- pod-delete-auth
condition:
- name: pod-delete-auth
status: Pass
- name: cpu-hog-worker
type: experiment
experimentRef: cpu-hog
spec:
target:
node: worker-2
cpuLoad: 70
duration: 120s
dependsOn:
- network-loss-api
condition:
- name: network-loss-api
status: Pass
kubectl apply -f advanced-workflow.yaml
# Expected output:
# chaosworkflow.litmuschaos.io/deployment-qualification created
# Watch workflow execution
litmusctl get workflows --status running
# Expected output:
# NAME STATUS CURRENT_STEP
# deployment-qualification Running network-loss-api
Step 2: Integrate with ArgoCD GitOps Pipeline
Store chaos experiments in the same Git repository as application manifests and sync them with ArgoCD:
# argocd-chaos-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: chaos-experiments
spec:
destination:
namespace: litmus
server: https://kubernetes.default.svc
source:
repoURL: https://github.com/dodatech/deployment-manifests
path: chaos-experiments/
targetRevision: main
syncPolicy:
automated:
prune: true
selfHeal: true
kubectl apply -f argocd-chaos-app.yaml
# Expected output:
# application.argoproj.io/chaos-experiments created
# Verify ArgoCD syncs the chaos experiments
argocd app get chaos-experiments
# Expected output:
# Name: chaos-experiments
# Status: Synced
# Health: Healthy
# Sync Policy: Automated
Step 3: Interpret Resilience Scores
After a workflow completes, LitmusChaos generates a resilience score:
# Get the resilience score for a workflow
litmusctl get workflow deployment-qualification --output json
# Expected output:
# {
# "resilienceScore": 87.5,
# "experiments": [
# {"name": "pod-delete-auth", "result": "Pass", "score": 100}, "# {"name": "network-loss-api"", "result": "Pass", "score": 100}, "# {"name": "cpu-hog-worker"", "result": "Fail", "score": 0}
# ],
# "totalExperiments": 6,
# "passedExperiments": 5,
# "failedExperiments": 1
# }
#!/usr/bin/env python3
"""Parse and analyze LitmusChaos resilience scores."""
import json
import sys
def analyze_resilience(score_data):
passed = score_data["passedExperiments"]
total = score_data["totalExperiments"]
score = score_data["resilienceScore"]
print(f"Resilience Score: {score}%")
print(f"Passed: {passed}/{total}")
for exp in score_data["experiments"]:
status_icon = "PASS" if exp["result"] == "Pass" else "FAIL"
print(f" [{status_icon}] {exp['name']} ({exp['score']}%)")
threshold = 80.0
if score >= threshold:
print(f"Threshold {threshold}% MET -- proceeding with deployment.")
return True
else:
print(f"Threshold {threshold}% NOT MET -- blocking deployment.")
return False
if __name__ == "__main__":
data = json.loads(sys.stdin.read())
analyze_resilience(data)
# Expected output when piped the JSON above:
# Resilience Score: 87.5%
# Passed: 5/6
# [PASS] pod-delete-auth (100%)
# [PASS] network-loss-api (100%)
# [FAIL] cpu-hog-worker (0%)
# Threshold 80.0% MET -- proceeding with deployment.
Step 4: Write Custom Probes
LitmusChaos supports HTTP, command, and prometheus probes for experiment validation:
# custom-probe-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-delete-with-probes
spec:
definition:
probes:
- name: http-health-probe
type: http
httpProbe/inputs:
url: http://auth-service:8080/health
expectedResponseCode: "200"
timeout: 5
- name: db-query-probe
type: cmd
cmdProbe/inputs:
command: pg_isready -h postgres-service -U app
expectedOutput: "accepting connections"
timeout: 10
- name: metrics-probe
type: prometheus
promProbe/inputs:
query: up{job="auth-service"} == 1
threshold: 1.0
# Run the experiment with custom probes
litmusctl create experiment pod-delete-with-probes \
--target-namespace staging \
--app-label app=auth-service \
--duration 30s
# Expected output:
# Probes configured: 3
# - http-health-probe (HTTP)
# - db-query-probe (Command)
# - metrics-probe (Prometheus)
# Experiment pod-delete-with-probes scheduled successfully
Learning Path
flowchart LR A[Litmus Basics] --> B[Litmus Advanced] B --> C[Chaos Engineering Pipeline] C --> D[Game Days] D --> E[Chaos Observability] style B fill:#f90,color:#fff
Common Errors
- Workflow steps without dependencies when order matters: If step C depends on step A succeeding and you omit the dependency, Litmus may run them in parallel and the results will be unreliable.
- Resilience thresholds that are too strict or too lenient: A threshold of 100 percent will block most deployments. A threshold below 60 percent is meaningless. Start at 80 percent and adjust based on historical data.
- Missing GitOps sync for experiment definitions: If ArgoCD deletes a running experiment on the next sync cycle, the experiment will be aborted. Use sync waves or manual sync for active experiments.
- Probe timeouts that are too short for degraded conditions: During an experiment the system is under stress. Probe timeouts should be 2-3 times longer than normal health check timeouts.
- Not version-pinning experiment definitions in Git: If an experiment definition changes without your knowledge, you may compare results across different experiment versions. Use Git tags for experiment definitions.
Practice Questions
- How does LitmusChaos calculate the resilience score from multiple experiments?
- What probe types does LitmusChaos support for experiment validation?
- How do you add conditional execution to a LitmusChaos workflow?
- What is the benefit of storing chaos experiments in the same Git Repository as application manifests?
- How do you set a resilience score threshold in a CI/CD pipeline?
Challenge
Create a LitmusChaos workflow that runs after every staging deployment with five experiments: pod-delete, network-loss, cpu-hog, node-drain, and dns-error. Configure HTTP probes on each service and a Prometheus probe on cluster health. Set the resilience threshold to 85 percent. If the score is below the threshold, prevent the deployment from promoting to production by failing the CI/CD pipeline step.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro