Skip to content

Azure Chaos Studio — Chaos Experiments on Azure

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Azure Chaos Studio. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Azure Chaos Studio is a managed Chaos Engineering service that enables you to run controlled Fault Injection experiments on Azure resources. It integrates natively with Azure Role-Based Access Control, Azure Monitor, and resource tagging.

What You Will Learn

This tutorial teaches you how to set up Azure Chaos Studio, enable targets, create experiments with agent-based and agentless faults, and use safety guards to protect your workloads.

Why It Matters

Azure Chaos Studio provides a first-party Chaos Engineering experience for organizations already invested in the Azure ecosystem. It supports both agentless faults (using Azure Resource Manager APIs) and agent-based faults (using a Chaos Agent on VMs).

Real-World Use

DodaTech runs Azure Chaos Studio experiments on Azure Kubernetes Service clusters that host parts of the Doda Browser backend. Experiments verify that AKS node failures do not cause data loss or extended downtime.

Prerequisites

Before starting you should understand:

  • Azure portal navigation and resource group concepts
  • Chaos Engineering fundamentals (hypothesis, Steady State, Blast Radius)
  • Basic knowledge of Kubernetes and virtual machines

Step 1: Enable Azure Chaos Studio

Register the Chaos Studio resource provider and enable it in your subscription:

# Register the Chaos Studio resource provider
az provider register --namespace Microsoft.Chaos

# Verify registration
az provider show --namespace Microsoft.Chaos --query registrationState
# Expected output:
# "Registered"

Step 2: Enable Targets

Before running experiments you must enable targets for the resources you want to test:

# Enable an Azure Kubernetes Service target
az chaos target create \
  --name "Microsoft-AKS" \
  --resource-group my-resource-group \
  --resource-name my-aks-cluster \
  --resource-type "Microsoft.ContainerService/managedClusters"

# Expected output:
# {
#   "id": "/subscriptions/.../targets/Microsoft-AKS",
#   "type": "Microsoft.Chaos/targets",
#   "properties": {
#     "agentProfile": null
#   }
# }

Step 3: Create an Experiment

Define the experiment with fault actions and safety guards:

{
  "location": "eastus",
  "identity": {
    "type": "SystemAssigned"
  },
  "properties": {
    "steps": [
      {
        "name": "AKS Pod Kill",
        "branches": [
          {
            "name": "Kill Pods",
            "actions": [
              {
                "type": "continuous",
                "name": "aks-pod-kill",
                "faultId": "Fault_AzureKubernetesService_PodKill",
                "parameters": [
                  {
                    "key": "namespaceField",
                    "value": "default]
                  },
                  {
                    "key": "labelSelector",
                    "value": "app=web-service"
                  }
                ],
                "duration": "PT60S"
              }
            ]
          }
        ]
      }
    ],
    "selectors": [
      {
        "id": "TargetA",
        "type": "List",
        "targets": [
          {
            "id": "/subscriptions/.../targets/Microsoft-AKS",
            "type": "Microsoft.Chaos/targets]
          }
        ]
      }
    ]
  }
}

Save this as aks-experiment.json and create the experiment:

az chaos experiment create \
  --name "AKS-Pod-Kill-Experiment" \
  --resource-group my-resource-group \
  --experiment-file aks-experiment.json

# Expected output:
# {
#   "name": "AKS-Pod-Kill-Experiment",
#   "provisioningState": "Succeeded"
# }

Step 4: Set Up Safety Guards

Configure Azure Monitor alerts to automatically stop experiments:

# Create a metric alert for error rate
az monitor metrics alert create \
  --name "Chaos-Error-Rate-Alert" \
  --resource-group my-resource-group \
  --scopes "/subscriptions/..." \
  --condition "count 'requests' > 10" \
  --window-size 5m \
  --evaluation-frequency 1m

# Expected output:
# {
#   "name": "Chaos-Error-Rate-Alert",
#   "enabled": true
# }

Step 5: Start and Monitor the Experiment

Run the experiment and track its progress:

az chaos experiment start \
  --name "AKS-Pod-Kill-Experiment" \
  --resource-group my-resource-group

# Expected output:
# {
#   "name": "AKS-Pod-Kill-Experiment",
#   "status": "Running"
# }

# Check experiment status
az chaos experiment list-execution-details \
  --name "AKS-Pod-Kill-Experiment" \
  --resource-group my-resource-group

# Expected output:
# [
#   {
#     "status": "Succeeded",
#     "faultIds": ["Fault_AzureKubernetesService_PodKill"]
#   }
# ]

Learning Path

flowchart LR
  A[AWS Fault Injection] --> B[Azure Chaos Studio]
  B --> C[Latency Injection]
  C --> D[Fault Injection Proxy]
  D --> E[Dependency Testing]
  style B fill:#f90,color:#fff

Common Errors

  1. Resource provider not registered: The Microsoft.Chaos provider must be registered in each subscription where you run experiments.
  2. Targets not enabled before experiment creation: You must enable each resource as a chaos target before it can be used in an experiment.
  3. Missing system-assigned identity on the experiment: Azure Chaos Studio requires a managed identity to execute faults on resources.
  4. Incorrect selector type for the target: List selectors need the exact resource ID. Query selectors use tags or names.
  5. Fault duration exceeding the resource timeout: Some faults have maximum durations. Check fault documentation before setting long durations.

Practice Questions

  1. What is the difference between agent-based and agentless faults in Azure Chaos Studio?
  2. How do you enable a target for chaos experiments in Azure?
  3. What role do Azure Monitor alerts play in Chaos Experiment safety?
  4. How do you define a multi-step experiment in Azure Chaos Studio?
  5. What identity configuration is required for Azure Chaos Studio?

Challenge

Create an Azure Chaos Studio experiment that shuts down a single virtual machine in a scale set for 120 seconds. Configure an Azure Monitor alert on the scale sets running instance count as a safety guard. Start the experiment and verify the auto-recovery behavior.

FAQ

What is Azure Chaos Studio?

Azure Chaos Studio is a managed Azure service for running controlled Fault Injection experiments to validate application resilience.

What fault types does Azure Chaos Studio support?

It supports AKS pod kills, VM shutdowns, network latency, CPU pressure, disk I/O delays, and Cosmos DB failures among others.

Do I need to install agents for Azure Chaos Studio?

Agentless faults (VM shutdown, AKS pod kill) need no agents. Agent-based faults (CPU pressure, disk I/O) require the Chaos Agent on the VM.

How does Azure Chaos Studio pricing work?

You pay per experiment execution. Each experiment hour is billed. There is a free tier with limited experiment hours per month.

Can Azure Chaos Studio roll back faults automatically?

Faults have a built-in duration. When the duration expires the fault is automatically stopped. Agentless faults like VM shutdown do not restart the VM automatically.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro