Skip to content

Azure Chaos Pipeline — Automated Experiments with DevOps

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Azure Chaos Pipeline. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Building an automated Azure Chaos Engineering pipeline requires connecting Azure Chaos Studio to your Azure DevOps or GitHub Actions workflows, deploying experiments as infrastructure as code, and integrating with Azure Monitor for automated safety controls.

What You Will Learn

This tutorial teaches you how to deploy Azure Chaos Studio experiments as ARM templates, trigger experiments from Azure DevOps pipelines, configure automated safety guards with Azure Monitor, and build a continuous validation workflow.

Why It Matters

Azure Chaos Studio experiments deployed as infrastructure as code give you repeatable, auditable, and version-controlled Resilience Testing. An automated pipeline ensures that every infrastructure deployment includes a resilience validation step, catching configuration errors before they reach production.

Real-World Use

DodaTech includes a Chaos Studio experiment in every Azure DevOps release pipeline for the Doda Browser backend services running on AKS. The experiment kills one pod from each deployment and verifies that error rates stay below the SLO threshold before the release is approved.

Prerequisites

Before starting you should understand:

  • Azure Chaos Studio basics from introductory tutorials
  • Chaos Engineering experiment design
  • ARM template or Bicep infrastructure as code
  • Azure DevOps or GitHub Actions

Step 1: Deploy Chaos Studio as Infrastructure as Code

Define Chaos Studio resources in an ARM template:

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "aksClusterName": {
      "type": "string",
      "defaultValue": "my-aks-cluster"
    },
    "targetResourceGroup": {
      "type": "string",
      "defaultValue": "chaos-rg"
    }
  },
  "resources": [
    {
      "type": "Microsoft.Chaos/targets",
      "apiVersion": "2023-11-01",
      "name": "[format('{0}/Microsoft-AKS', parameters('aksClusterName'))]",
      "location": "eastus",
      "properties": {}
    },
    {
      "type": "Microsoft.Chaos/experiments",
      "apiVersion": "2023-11-01",
      "name": "post-deployment-chaos",
      "location": "eastus",
      "identity": {
        "type": "SystemAssigned"
      },
      "properties": {
        "steps": [
          {
            "name": "AKS Validation",
            "branches": [
              {
                "name": "Pod Kill Test",
                "actions": [
                  {
                    "type": "continuous",
                    "name": "aks-pod-kill",
                    "faultId": "Fault_AzureKubernetesService_PodKill",
                    "parameters": [
                      {"key": "namespaceField", "value": "production"},
                      {"key": "labelSelector", "value": "app=web-service"}
                    ],
                    "duration": "PT60S"
                  }
                ]
              }
            ]
          }
        ],
        "selectors": [
          {
            "id": "TargetAKS",
            "type": "List",
            "targets": [
              {
                "type": "Microsoft.Chaos/targets",
                "id": "[resourceId('Microsoft.Chaos/targets', format('{0}/Microsoft-AKS', parameters('aksClusterName')))]"
              }
            ]
          }
        ]
      }
    }
  ]
}
az deployment group create \
  --resource-group chaos-rg \
  --template-file chaos-studio-template.json \
  --parameters aksClusterName=my-aks-cluster

# Expected output:
# {
#   "provisioningState": "Succeeded",
#   "outputs": {
#     "experimentName": { "type": "String", "value": "post-deployment-chaos" }
#   }
# }

Step 2: Trigger Experiments from Azure DevOps Pipeline

Add a chaos step to your Azure DevOps pipeline:

# azure-pipelines-chaos.yml
trigger:
  - main

pool:
  vmImage: ubuntu-latest

variables:
  resourceGroup: chaos-rg
  experimentName: post-deployment-chaos

stages:
  - stage: Deploy
    jobs:
      - job: DeployApp
        steps:
          - task: AzureCLI@2
            displayName: Deploy application
            inputs:
              azureSubscription: chaos-service-connection
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                echo "Deploying application to AKS..."
                kubectl apply -f k8s-manifests/

  - stage: ChaosValidation
    dependsOn: Deploy
    jobs:
      - job: RunChaos
        steps:
          - task: AzureCLI@2
            displayName: Run Chaos Studio experiment
            inputs:
              azureSubscription: chaos-service-connection
              scriptType: bash
              scriptLocation: inlineScript
              inlineScript: |
                echo "Starting chaos experiment..."
                az chaos experiment start \
                  --name $(experimentName) \
                  --resource-group $(resourceGroup)

                echo "Waiting for experiment to complete..."
                sleep 90

                az chaos experiment list-execution-details \
                  --name $(experimentName) \
                  --resource-group $(resourceGroup) \
                  --query "[0].status" -o tsv

  - stage: PromoteOrRollback
    dependsOn: ChaosValidation
    condition: succeeded()
    jobs:
      - job: Promote
        steps:
          - script: echo "Chaos validation passed. Promoting to production."

Expected output in Azure DevOps:

Deploy application: Completed
Run Chaos Studio experiment: Started
Waiting for experiment to complete...
Experiment status: Succeeded
Chaos validation passed. Promoting to production.

Step 3: Configure Automated Safety Guards

Set up Azure Monitor alerts that automatically stop experiments:

# Create a log analytics workspace for experiment telemetry
az monitor log-analytics workspace create \
  --resource-group chaos-rg \
  --workspace-name chaos-telemetry

# Expected output:
# {
#   "provisioningState": "Succeeded",
#   "workspaceId": "ws-abc-123"
# }

# Configure diagnostic settings for Chaos Studio
az monitor diagnostic-settings create \
  --name chaos-logs \
  --resource "/subscriptions/.../chaos/experiments/post-deployment-chaos" \
  --workspace chaos-telemetry \
  --logs '[{"category": "ChaosExperiment", "enabled": true}]' \
  --metrics '[{"category": "AllMetrics", "enabled": true}]'

# Expected output:
# {
#   "name": "chaos-logs",
#   "provisioningState": "Succeeded"
# }
#!/usr/bin/env python3
"""Query Chaos Studio experiment telemetry from Azure Monitor."""
from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient
import json

credential = DefaultAzureCredential()
client = LogsQueryClient(credential)

WORKSPACE_ID = "ws-abc-123"
QUERY = """
ChaosExperiment
| where TimeGenerated > ago(1h)
| project TimeGenerated, ExperimentName, Status, FaultName
| order by TimeGenerated desc
"""

response = client.query_workspace(WORKSPACE_ID, QUERY)

for row in response.tables[0].rows:
    print(f"Time: {row[0]}, Experiment: {row[1]}, Status: {row[2]}, Fault: {row[3]}")

# Expected output:
# Time: 2026-06-23T10:00:00Z, Experiment: post-deployment-chaos, Status: Succeeded, Fault: aks-pod-kill
# Time: 2026-06-23T09:00:00Z, Experiment: post-deployment-chaos, Status: Succeeded, Fault: aks-pod-kill

Step 4: Add Approval Gates for Production Experiments

Define a release pipeline with manual approval for production chaos:

# release-chaos-approval.yaml
stages:
  - stage: StagingChaos
    jobs:
      - job: RunStagingChaos
        steps:
          - script: echo "Running chaos in staging..."
          - task: AzureCLI@2
            inputs:
              inlineScript: |
                az chaos experiment start \
                  --name staging-chaos \
                  --resource-group chaos-rg

  - stage: ProductionApproval
    dependsOn: StagingChaos
    condition: succeeded()
    jobs:
      - deployment: ApprovalGate
        environment: production
        strategy:
          runOnce:
            deploy:
              steps:
                - script: echo "Approval granted. Running production chaos..."
                - task: AzureCLI@2
                  inputs:
                    inlineScript: |
                      az chaos experiment start \
                        --name production-chaos \
                        --resource-group chaos-rg

Learning Path

flowchart LR
  A[Azure Chaos Studio Basics] --> B[Azure Chaos Pipeline]
  B --> C[Kubernetes Chaos Testing]
  C --> D[Network Chaos Testing]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

  1. Managed identity not assigned to the experiment: Chaos Studio experiments need a system-assigned or user-assigned managed identity with permissions on the target resources.
  2. ARM template deployment fails due to missing provider registration: Register Microsoft.Chaos in every subscription before deploying ARM templates.
  3. Pipeline timeout shorter than experiment fault duration: Increase the Azure DevOps job timeout to exceed the experiment duration plus a buffer.
  4. Chaos Studio selector references invalid target IDs: Target IDs are subscription-scoped. Verify the full resource ID in the selector configuration.
  5. Diagnostic settings not capturing experiment logs: Ensure the Log Analytics workspace is in the same region as Chaos Studio or diagnostic settings will fail.

Practice Questions

  1. How do you deploy Chaos Studio experiments as infrastructure as code?
  2. What Azure DevOps task runs Chaos Studio experiments in a pipeline?
  3. How do you configure automated safety guards using Azure Monitor?
  4. What is the purpose of diagnostic settings for Chaos Studio?
  5. How do you add manual approval gates for production experiments?

Challenge

Create a complete Azure DevOps pipeline with three stages: deploy an application to AKS, run a Chaos Studio pod kill experiment against the deployment, and promote or roll back based on the experiment result. Use an ARM template to deploy the Chaos Studio resources, Azure Monitor alerts as stop conditions, and a manual approval gate for the production environment.

FAQ

How do I integrate Chaos Studio with Azure DevOps?

Use the Azure CLI task in Azure DevOps to run az <a href="/chaos-engineering/chaos-experiments/">Chaos Experiment</a> start and then poll the experiment status. Integrate the result into the pipeline gate logic.

What is the benefit of deploying Chaos Studio as ARM templates?

ARM templates provide version control, repeatability, and audit trail for chaos experiments. They can be reviewed in pull requests and deployed consistently across environments.

How do Azure Monitor alerts stop chaos experiments automatically?

Configure the alert as a stop condition in the experiment. When the alert fires, Chaos Studio automatically stops all active faults and marks the experiment as failed.

What is a Chaos Studio selector?

A selector defines which resources the experiment targets. List selectors target specific resource IDs. Query selectors target resources matching a tag or name pattern.

Can I run Chaos Studio experiments across multiple Azure regions?

Yes. Create separate experiments for each region and use Azure DevOps multi-stage pipelines to coordinate cross-region chaos testing.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro