Azure Chaos Studio — Chaos Experiments on Azure
In this tutorial, you'll learn about Azure Chaos Studio. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Azure Chaos Studio is a managed Chaos Engineering service that enables you to run controlled Fault Injection experiments on Azure resources. It integrates natively with Azure Role-Based Access Control, Azure Monitor, and resource tagging.
What You Will Learn
This tutorial teaches you how to set up Azure Chaos Studio, enable targets, create experiments with agent-based and agentless faults, and use safety guards to protect your workloads.
Why It Matters
Azure Chaos Studio provides a first-party Chaos Engineering experience for organizations already invested in the Azure ecosystem. It supports both agentless faults (using Azure Resource Manager APIs) and agent-based faults (using a Chaos Agent on VMs).
Real-World Use
DodaTech runs Azure Chaos Studio experiments on Azure Kubernetes Service clusters that host parts of the Doda Browser backend. Experiments verify that AKS node failures do not cause data loss or extended downtime.
Prerequisites
Before starting you should understand:
- Azure portal navigation and resource group concepts
- Chaos Engineering fundamentals (hypothesis, Steady State, Blast Radius)
- Basic knowledge of Kubernetes and virtual machines
Step 1: Enable Azure Chaos Studio
Register the Chaos Studio resource provider and enable it in your subscription:
# Register the Chaos Studio resource provider
az provider register --namespace Microsoft.Chaos
# Verify registration
az provider show --namespace Microsoft.Chaos --query registrationState
# Expected output:
# "Registered"
Step 2: Enable Targets
Before running experiments you must enable targets for the resources you want to test:
# Enable an Azure Kubernetes Service target
az chaos target create \
--name "Microsoft-AKS" \
--resource-group my-resource-group \
--resource-name my-aks-cluster \
--resource-type "Microsoft.ContainerService/managedClusters"
# Expected output:
# {
# "id": "/subscriptions/.../targets/Microsoft-AKS",
# "type": "Microsoft.Chaos/targets",
# "properties": {
# "agentProfile": null
# }
# }
Step 3: Create an Experiment
Define the experiment with fault actions and safety guards:
{
"location": "eastus",
"identity": {
"type": "SystemAssigned"
},
"properties": {
"steps": [
{
"name": "AKS Pod Kill",
"branches": [
{
"name": "Kill Pods",
"actions": [
{
"type": "continuous",
"name": "aks-pod-kill",
"faultId": "Fault_AzureKubernetesService_PodKill",
"parameters": [
{
"key": "namespaceField",
"value": "default]
},
{
"key": "labelSelector",
"value": "app=web-service"
}
],
"duration": "PT60S"
}
]
}
]
}
],
"selectors": [
{
"id": "TargetA",
"type": "List",
"targets": [
{
"id": "/subscriptions/.../targets/Microsoft-AKS",
"type": "Microsoft.Chaos/targets]
}
]
}
]
}
}
Save this as aks-experiment.json and create the experiment:
az chaos experiment create \
--name "AKS-Pod-Kill-Experiment" \
--resource-group my-resource-group \
--experiment-file aks-experiment.json
# Expected output:
# {
# "name": "AKS-Pod-Kill-Experiment",
# "provisioningState": "Succeeded"
# }
Step 4: Set Up Safety Guards
Configure Azure Monitor alerts to automatically stop experiments:
# Create a metric alert for error rate
az monitor metrics alert create \
--name "Chaos-Error-Rate-Alert" \
--resource-group my-resource-group \
--scopes "/subscriptions/..." \
--condition "count 'requests' > 10" \
--window-size 5m \
--evaluation-frequency 1m
# Expected output:
# {
# "name": "Chaos-Error-Rate-Alert",
# "enabled": true
# }
Step 5: Start and Monitor the Experiment
Run the experiment and track its progress:
az chaos experiment start \
--name "AKS-Pod-Kill-Experiment" \
--resource-group my-resource-group
# Expected output:
# {
# "name": "AKS-Pod-Kill-Experiment",
# "status": "Running"
# }
# Check experiment status
az chaos experiment list-execution-details \
--name "AKS-Pod-Kill-Experiment" \
--resource-group my-resource-group
# Expected output:
# [
# {
# "status": "Succeeded",
# "faultIds": ["Fault_AzureKubernetesService_PodKill"]
# }
# ]
Learning Path
flowchart LR A[AWS Fault Injection] --> B[Azure Chaos Studio] B --> C[Latency Injection] C --> D[Fault Injection Proxy] D --> E[Dependency Testing] style B fill:#f90,color:#fff
Common Errors
- Resource provider not registered: The Microsoft.Chaos provider must be registered in each subscription where you run experiments.
- Targets not enabled before experiment creation: You must enable each resource as a chaos target before it can be used in an experiment.
- Missing system-assigned identity on the experiment: Azure Chaos Studio requires a managed identity to execute faults on resources.
- Incorrect selector type for the target: List selectors need the exact resource ID. Query selectors use tags or names.
- Fault duration exceeding the resource timeout: Some faults have maximum durations. Check fault documentation before setting long durations.
Practice Questions
- What is the difference between agent-based and agentless faults in Azure Chaos Studio?
- How do you enable a target for chaos experiments in Azure?
- What role do Azure Monitor alerts play in Chaos Experiment safety?
- How do you define a multi-step experiment in Azure Chaos Studio?
- What identity configuration is required for Azure Chaos Studio?
Challenge
Create an Azure Chaos Studio experiment that shuts down a single virtual machine in a scale set for 120 seconds. Configure an Azure Monitor alert on the scale sets running instance count as a safety guard. Start the experiment and verify the auto-recovery behavior.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro