Azure Chaos Pipeline — Automated Experiments with DevOps
In this tutorial, you'll learn about Azure Chaos Pipeline. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Building an automated Azure Chaos Engineering pipeline requires connecting Azure Chaos Studio to your Azure DevOps or GitHub Actions workflows, deploying experiments as infrastructure as code, and integrating with Azure Monitor for automated safety controls.
What You Will Learn
This tutorial teaches you how to deploy Azure Chaos Studio experiments as ARM templates, trigger experiments from Azure DevOps pipelines, configure automated safety guards with Azure Monitor, and build a continuous validation workflow.
Why It Matters
Azure Chaos Studio experiments deployed as infrastructure as code give you repeatable, auditable, and version-controlled Resilience Testing. An automated pipeline ensures that every infrastructure deployment includes a resilience validation step, catching configuration errors before they reach production.
Real-World Use
DodaTech includes a Chaos Studio experiment in every Azure DevOps release pipeline for the Doda Browser backend services running on AKS. The experiment kills one pod from each deployment and verifies that error rates stay below the SLO threshold before the release is approved.
Prerequisites
Before starting you should understand:
- Azure Chaos Studio basics from introductory tutorials
- Chaos Engineering experiment design
- ARM template or Bicep infrastructure as code
- Azure DevOps or GitHub Actions
Step 1: Deploy Chaos Studio as Infrastructure as Code
Define Chaos Studio resources in an ARM template:
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"aksClusterName": {
"type": "string",
"defaultValue": "my-aks-cluster"
},
"targetResourceGroup": {
"type": "string",
"defaultValue": "chaos-rg"
}
},
"resources": [
{
"type": "Microsoft.Chaos/targets",
"apiVersion": "2023-11-01",
"name": "[format('{0}/Microsoft-AKS', parameters('aksClusterName'))]",
"location": "eastus",
"properties": {}
},
{
"type": "Microsoft.Chaos/experiments",
"apiVersion": "2023-11-01",
"name": "post-deployment-chaos",
"location": "eastus",
"identity": {
"type": "SystemAssigned"
},
"properties": {
"steps": [
{
"name": "AKS Validation",
"branches": [
{
"name": "Pod Kill Test",
"actions": [
{
"type": "continuous",
"name": "aks-pod-kill",
"faultId": "Fault_AzureKubernetesService_PodKill",
"parameters": [
{"key": "namespaceField", "value": "production"},
{"key": "labelSelector", "value": "app=web-service"}
],
"duration": "PT60S"
}
]
}
]
}
],
"selectors": [
{
"id": "TargetAKS",
"type": "List",
"targets": [
{
"type": "Microsoft.Chaos/targets",
"id": "[resourceId('Microsoft.Chaos/targets', format('{0}/Microsoft-AKS', parameters('aksClusterName')))]"
}
]
}
]
}
}
]
}
az deployment group create \
--resource-group chaos-rg \
--template-file chaos-studio-template.json \
--parameters aksClusterName=my-aks-cluster
# Expected output:
# {
# "provisioningState": "Succeeded",
# "outputs": {
# "experimentName": { "type": "String", "value": "post-deployment-chaos" }
# }
# }
Step 2: Trigger Experiments from Azure DevOps Pipeline
Add a chaos step to your Azure DevOps pipeline:
# azure-pipelines-chaos.yml
trigger:
- main
pool:
vmImage: ubuntu-latest
variables:
resourceGroup: chaos-rg
experimentName: post-deployment-chaos
stages:
- stage: Deploy
jobs:
- job: DeployApp
steps:
- task: AzureCLI@2
displayName: Deploy application
inputs:
azureSubscription: chaos-service-connection
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
echo "Deploying application to AKS..."
kubectl apply -f k8s-manifests/
- stage: ChaosValidation
dependsOn: Deploy
jobs:
- job: RunChaos
steps:
- task: AzureCLI@2
displayName: Run Chaos Studio experiment
inputs:
azureSubscription: chaos-service-connection
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
echo "Starting chaos experiment..."
az chaos experiment start \
--name $(experimentName) \
--resource-group $(resourceGroup)
echo "Waiting for experiment to complete..."
sleep 90
az chaos experiment list-execution-details \
--name $(experimentName) \
--resource-group $(resourceGroup) \
--query "[0].status" -o tsv
- stage: PromoteOrRollback
dependsOn: ChaosValidation
condition: succeeded()
jobs:
- job: Promote
steps:
- script: echo "Chaos validation passed. Promoting to production."
Expected output in Azure DevOps:
Deploy application: Completed
Run Chaos Studio experiment: Started
Waiting for experiment to complete...
Experiment status: Succeeded
Chaos validation passed. Promoting to production.
Step 3: Configure Automated Safety Guards
Set up Azure Monitor alerts that automatically stop experiments:
# Create a log analytics workspace for experiment telemetry
az monitor log-analytics workspace create \
--resource-group chaos-rg \
--workspace-name chaos-telemetry
# Expected output:
# {
# "provisioningState": "Succeeded",
# "workspaceId": "ws-abc-123"
# }
# Configure diagnostic settings for Chaos Studio
az monitor diagnostic-settings create \
--name chaos-logs \
--resource "/subscriptions/.../chaos/experiments/post-deployment-chaos" \
--workspace chaos-telemetry \
--logs '[{"category": "ChaosExperiment", "enabled": true}]' \
--metrics '[{"category": "AllMetrics", "enabled": true}]'
# Expected output:
# {
# "name": "chaos-logs",
# "provisioningState": "Succeeded"
# }
#!/usr/bin/env python3
"""Query Chaos Studio experiment telemetry from Azure Monitor."""
from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient
import json
credential = DefaultAzureCredential()
client = LogsQueryClient(credential)
WORKSPACE_ID = "ws-abc-123"
QUERY = """
ChaosExperiment
| where TimeGenerated > ago(1h)
| project TimeGenerated, ExperimentName, Status, FaultName
| order by TimeGenerated desc
"""
response = client.query_workspace(WORKSPACE_ID, QUERY)
for row in response.tables[0].rows:
print(f"Time: {row[0]}, Experiment: {row[1]}, Status: {row[2]}, Fault: {row[3]}")
# Expected output:
# Time: 2026-06-23T10:00:00Z, Experiment: post-deployment-chaos, Status: Succeeded, Fault: aks-pod-kill
# Time: 2026-06-23T09:00:00Z, Experiment: post-deployment-chaos, Status: Succeeded, Fault: aks-pod-kill
Step 4: Add Approval Gates for Production Experiments
Define a release pipeline with manual approval for production chaos:
# release-chaos-approval.yaml
stages:
- stage: StagingChaos
jobs:
- job: RunStagingChaos
steps:
- script: echo "Running chaos in staging..."
- task: AzureCLI@2
inputs:
inlineScript: |
az chaos experiment start \
--name staging-chaos \
--resource-group chaos-rg
- stage: ProductionApproval
dependsOn: StagingChaos
condition: succeeded()
jobs:
- deployment: ApprovalGate
environment: production
strategy:
runOnce:
deploy:
steps:
- script: echo "Approval granted. Running production chaos..."
- task: AzureCLI@2
inputs:
inlineScript: |
az chaos experiment start \
--name production-chaos \
--resource-group chaos-rg
Learning Path
flowchart LR A[Azure Chaos Studio Basics] --> B[Azure Chaos Pipeline] B --> C[Kubernetes Chaos Testing] C --> D[Network Chaos Testing] D --> E[Chaos Observability] style B fill:#f90,color:#fff
Common Errors
- Managed identity not assigned to the experiment: Chaos Studio experiments need a system-assigned or user-assigned managed identity with permissions on the target resources.
- ARM template deployment fails due to missing provider registration: Register
Microsoft.Chaosin every subscription before deploying ARM templates. - Pipeline timeout shorter than experiment fault duration: Increase the Azure DevOps job timeout to exceed the experiment duration plus a buffer.
- Chaos Studio selector references invalid target IDs: Target IDs are subscription-scoped. Verify the full resource ID in the selector configuration.
- Diagnostic settings not capturing experiment logs: Ensure the Log Analytics workspace is in the same region as Chaos Studio or diagnostic settings will fail.
Practice Questions
- How do you deploy Chaos Studio experiments as infrastructure as code?
- What Azure DevOps task runs Chaos Studio experiments in a pipeline?
- How do you configure automated safety guards using Azure Monitor?
- What is the purpose of diagnostic settings for Chaos Studio?
- How do you add manual approval gates for production experiments?
Challenge
Create a complete Azure DevOps pipeline with three stages: deploy an application to AKS, run a Chaos Studio pod kill experiment against the deployment, and promote or roll back based on the experiment result. Use an ARM template to deploy the Chaos Studio resources, Azure Monitor alerts as stop conditions, and a manual approval gate for the production environment.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro