Skip to content

LitmusChaos Advanced — Workflows, GitOps & Resilience Scores

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about LitmusChaos Advanced. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Litmus advanced capabilities include workflow Orchestration with conditional branching, GitOps integration through ArgoCD and Flux, automated resilience scoring, and custom probe types. Chaos Engineering teams use these features to embed Resilience Testing directly into their delivery pipelines.

What You Will Learn

This tutorial teaches you how to create LitmusChaos workflows with conditional logic, integrate chaos experiments into GitOps pipelines, interpret resilience scores, and write custom HTTP and command probes for experiment validation.

Why It Matters

Manual chaos experiments do not scale. LitmusChaos advanced features let you define experiments as code, run them automatically as part of deployments, and score every service on its resilience. This transforms Chaos Engineering from an occasional exercise into a continuous quality gate.

Real-World Use

DodaTech runs a LitmusChaos workflow after every deployment to the Doda Browser backend. The workflow runs six experiments in sequence, and the resilience score must be above 85 percent for the deployment to proceed to the next environment.

Prerequisites

Before starting you should understand:

  • Basic Litmus installation and experiment execution
  • Kubernetes custom resources and controllers
  • GitOps concepts (ArgoCD or Flux)
  • CI/CD pipeline basics

Step 1: Create a LitmusChaos Workflow with Conditions

LitmusChaos workflows support conditional branching based on experiment results:

# advanced-workflow.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosWorkflow
metadata:
  name: deployment-qualification
spec:
  workflow:
    steps:
      - name: pod-delete-auth
        type: experiment
        experimentRef: pod-delete
        spec:
          target:
            appLabel: app=auth-service
            namespace: staging
            duration: 30s
      - name: network-loss-api
        type: experiment
        experimentRef: network-loss
        spec:
          target:
            appLabel: app=api-gateway
            namespace: staging
            duration: 60s
            packetLossPercent: 20
        dependsOn:
          - pod-delete-auth
        condition:
          - name: pod-delete-auth
            status: Pass
      - name: cpu-hog-worker
        type: experiment
        experimentRef: cpu-hog
        spec:
          target:
            node: worker-2
            cpuLoad: 70
            duration: 120s
        dependsOn:
          - network-loss-api
        condition:
          - name: network-loss-api
            status: Pass
kubectl apply -f advanced-workflow.yaml
# Expected output:
# chaosworkflow.litmuschaos.io/deployment-qualification created

# Watch workflow execution
litmusctl get workflows --status running
# Expected output:
# NAME                      STATUS     CURRENT_STEP
# deployment-qualification  Running    network-loss-api

Step 2: Integrate with ArgoCD GitOps Pipeline

Store chaos experiments in the same Git repository as application manifests and sync them with ArgoCD:

# argocd-chaos-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: chaos-experiments
spec:
  destination:
    namespace: litmus
    server: https://kubernetes.default.svc
  source:
    repoURL: https://github.com/dodatech/deployment-manifests
    path: chaos-experiments/
    targetRevision: main
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
kubectl apply -f argocd-chaos-app.yaml
# Expected output:
# application.argoproj.io/chaos-experiments created

# Verify ArgoCD syncs the chaos experiments
argocd app get chaos-experiments
# Expected output:
# Name:               chaos-experiments
# Status:             Synced
# Health:             Healthy
# Sync Policy:        Automated

Step 3: Interpret Resilience Scores

After a workflow completes, LitmusChaos generates a resilience score:

# Get the resilience score for a workflow
litmusctl get workflow deployment-qualification --output json
# Expected output:
# {
#   "resilienceScore": 87.5,
#   "experiments": [
#     {"name": "pod-delete-auth", "result": "Pass", "score": 100}, "#     {"name": "network-loss-api"", "result": "Pass", "score": 100}, "#     {"name": "cpu-hog-worker"", "result": "Fail", "score": 0}
#   ],
#   "totalExperiments": 6,
#   "passedExperiments": 5,
#   "failedExperiments": 1
# }
#!/usr/bin/env python3
"""Parse and analyze LitmusChaos resilience scores."""
import json
import sys

def analyze_resilience(score_data):
    passed = score_data["passedExperiments"]
    total = score_data["totalExperiments"]
    score = score_data["resilienceScore"]

    print(f"Resilience Score: {score}%")
    print(f"Passed: {passed}/{total}")

    for exp in score_data["experiments"]:
        status_icon = "PASS" if exp["result"] == "Pass" else "FAIL"
        print(f"  [{status_icon}] {exp['name']} ({exp['score']}%)")

    threshold = 80.0
    if score >= threshold:
        print(f"Threshold {threshold}% MET -- proceeding with deployment.")
        return True
    else:
        print(f"Threshold {threshold}% NOT MET -- blocking deployment.")
        return False

if __name__ == "__main__":
    data = json.loads(sys.stdin.read())
    analyze_resilience(data)
    # Expected output when piped the JSON above:
    # Resilience Score: 87.5%
    # Passed: 5/6
    #   [PASS] pod-delete-auth (100%)
    #   [PASS] network-loss-api (100%)
    #   [FAIL] cpu-hog-worker (0%)
    # Threshold 80.0% MET -- proceeding with deployment.

Step 4: Write Custom Probes

LitmusChaos supports HTTP, command, and prometheus probes for experiment validation:

# custom-probe-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete-with-probes
spec:
  definition:
    probes:
      - name: http-health-probe
        type: http
        httpProbe/inputs:
          url: http://auth-service:8080/health
          expectedResponseCode: "200"
          timeout: 5
      - name: db-query-probe
        type: cmd
        cmdProbe/inputs:
          command: pg_isready -h postgres-service -U app
          expectedOutput: "accepting connections"
          timeout: 10
      - name: metrics-probe
        type: prometheus
        promProbe/inputs:
          query: up{job="auth-service"} == 1
          threshold: 1.0
# Run the experiment with custom probes
litmusctl create experiment pod-delete-with-probes \
  --target-namespace staging \
  --app-label app=auth-service \
  --duration 30s

# Expected output:
# Probes configured: 3
#   - http-health-probe (HTTP)
#   - db-query-probe (Command)
#   - metrics-probe (Prometheus)
# Experiment pod-delete-with-probes scheduled successfully

Learning Path

flowchart LR
  A[Litmus Basics] --> B[Litmus Advanced]
  B --> C[Chaos Engineering Pipeline]
  C --> D[Game Days]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

  1. Workflow steps without dependencies when order matters: If step C depends on step A succeeding and you omit the dependency, Litmus may run them in parallel and the results will be unreliable.
  2. Resilience thresholds that are too strict or too lenient: A threshold of 100 percent will block most deployments. A threshold below 60 percent is meaningless. Start at 80 percent and adjust based on historical data.
  3. Missing GitOps sync for experiment definitions: If ArgoCD deletes a running experiment on the next sync cycle, the experiment will be aborted. Use sync waves or manual sync for active experiments.
  4. Probe timeouts that are too short for degraded conditions: During an experiment the system is under stress. Probe timeouts should be 2-3 times longer than normal health check timeouts.
  5. Not version-pinning experiment definitions in Git: If an experiment definition changes without your knowledge, you may compare results across different experiment versions. Use Git tags for experiment definitions.

Practice Questions

  1. How does LitmusChaos calculate the resilience score from multiple experiments?
  2. What probe types does LitmusChaos support for experiment validation?
  3. How do you add conditional execution to a LitmusChaos workflow?
  4. What is the benefit of storing chaos experiments in the same Git Repository as application manifests?
  5. How do you set a resilience score threshold in a CI/CD pipeline?

Challenge

Create a LitmusChaos workflow that runs after every staging deployment with five experiments: pod-delete, network-loss, cpu-hog, node-drain, and dns-error. Configure HTTP probes on each service and a Prometheus probe on cluster health. Set the resilience threshold to 85 percent. If the score is below the threshold, prevent the deployment from promoting to production by failing the CI/CD pipeline step.

FAQ

What is a LitmusChaos resilience score?

The resilience score is a percentage calculated from the number of experiments that pass versus the total number of experiments in a workflow, weighted by experiment complexity.

How does GitOps integration work with LitmusChaos?

LitmusChaos experiments are stored as YAML files in a Git Repository. ArgoCD or Flux syncs these files to the cluster, ensuring experiments are version-controlled and reviewed before execution.

What custom probe types does LitmusChaos support?

LitmusChaos supports HTTP probes (checking response codes), command probes (running shell commands), and Prometheus probes (querying metrics).

Can LitmusChaos workflows run in parallel?

Yes. Workflow steps without dependencies run in parallel. You can also use the Parallel template type to explicitly define parallel execution branches.

How do I integrate LitmusChaos into a CI/CD pipeline?

Run the Litmus CLI or API from your pipeline after deployment, execute the workflow, check the resilience score, and fail the pipeline step if the score is below the configured threshold.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro