Skip to content

Azure Chaos Studio Guide — Managed Fault Injection for Azure Resources

DodaTech Updated 2026-06-23 5 min read

In this tutorial, you'll learn about Azure Chaos Studio Guide. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Azure Chaos Studio is a managed Chaos Engineering service that enables controlled Fault Injection on Azure resources. It integrates natively with Azure Role-Based Access Control, Azure Monitor, and resource tagging to provide a secure Chaos Engineering experience within the Azure ecosystem.

What You Will Learn

This tutorial teaches you how to enable Azure Chaos Studio targets, create experiments with both agentless and agent-based faults, configure Azure Monitor safety guards, run experiments on AKS and VMs, and analyze results.

Why It Matters

Azure Chaos Studio provides a first-party Chaos Engineering solution for organizations invested in Azure. Agentless faults use Azure Resource Manager APIs and require no installation. Agent-based faults cover scenarios like CPU pressure and disk I/O. This flexibility lets you test a wide range of failure modes without managing third-party tools.

Real-World Use

DodaTech runs Azure Chaos Studio experiments on AKS clusters that host backend services for the Doda Browser. Experiments validate that node failures and pod terminations do not cause data loss or extended downtime, ensuring a reliable experience for millions of users.

Prerequisites

Before starting you should understand:

  • Azure portal navigation, resource groups, and CLI
  • Chaos Engineering fundamentals (hypothesis, Steady State, Blast Radius)
  • Kubernetes concepts and AKS basics
  • Azure Monitor for alerts and metrics

Step 1: Register the Chaos Studio Provider

Enable the Chaos Studio resource provider in your Azure subscription.

# Register the Microsoft.Chaos resource provider
az provider register --namespace Microsoft.Chaos

# Verify registration status
az provider show --namespace Microsoft.Chaos --query registrationState

# Expected output:
# "Registered"

Step 2: Enable Targets

Enable chaos targets on the resources you want to experiment with.

# Enable an AKS cluster as a chaos target
az chaos target create \
  --name "Microsoft-AKS" \
  --resource-group dodatech-rg \
  --resource-name dodatech-aks-cluster \
  --resource-type "Microsoft.ContainerService/managedClusters"

# Expected output:
# {
#   "id": "/subscriptions/.../targets/Microsoft-AKS",
#   "type": "Microsoft.Chaos/targets",
#   "properties": {"agentProfile": null}
# }

# Enable a Virtual Machine as a chaos target
az chaos target create \
  --name "Microsoft-VirtualMachine" \
  --resource-group dodatech-rg \
  --resource-name dodatech-vm-001 \
  --resource-type "Microsoft.Compute/virtualMachines"

Step 3: Create an Agentless AKS Experiment

Create an experiment that kills pods in an AKS namespace.

{
  "location": "eastus",
  "identity": {"type": "SystemAssigned"},
  "properties": {
    "steps": [
      {
        "name": "AKS Pod Kill Step",
        "branches": [
          {
            "name": "Kill Pods",
            "actions": [
              {
                "type": "continuous",
                "name": "aks-pod-kill",
                "faultId": "Fault_AzureKubernetesService_PodKill",
                "parameters": [
                  {"key": "namespaceField", "value": "default"},
                  {"key": "labelSelector", "value": "app=web-service"}
                ],
                "duration": "PT60S"
              }
            ]
          }
        ]
      }
    ],
    "selectors": [
      {
        "id": "TargetA",
        "type": "List",
        "targets": [
          {
            "type": "Microsoft.Chaos/targets",
            "id": "/subscriptions/.../targets/Microsoft-AKS]
          }
        ]
      }
    ]
  }
}
# Create the experiment
az chaos experiment create \
  --name "AKS-Pod-Kill-Experiment" \
  --resource-group dodatech-rg \
  --experiment-file aks-experiment.json

# Expected output:
# {
#   "name": "AKS-Pod-Kill-Experiment",
#   "provisioningState": "Succeeded"
# }

Step 4: Configure Safety Guards

Set up Azure Monitor alerts that automatically stop experiments when thresholds are breached.

# Create a metric alert for error rate
az monitor metrics alert create \
  --name "Chaos-Error-Rate-Alert" \
  --resource-group dodatech-rg \
  --scopes "/subscriptions/..." \
  --condition "count 'requests' > 10" \
  --window-size 5m \
  --evaluation-frequency 1m

# Expected output:
# {
#   "name": "Chaos-Error-Rate-Alert",
#   "enabled": true
# }

Step 5: Start and Monitor the Experiment

Run the experiment and check execution details.

# Start the experiment
az chaos experiment start \
  --name "AKS-Pod-Kill-Experiment" \
  --resource-group dodatech-rg

# Expected output:
# {
#   "name": "AKS-Pod-Kill-Experiment",
#   "status": "Running"
# }

# Check execution details
az chaos experiment list-execution-details \
  --name "AKS-Pod-Kill-Experiment" \
  --resource-group dodatech-rg

# Expected output:
# [
#   {
#     "status": "Succeeded",
#     "faultIds": ["Fault_AzureKubernetesService_PodKill"]
#   }
# ]

Learning Path

flowchart LR
  A[AWS Fault Injection] --> B[Azure Chaos Studio]
  B --> C[Latency Injection]
  C --> D[Fault Injection Proxy]
  D --> E[Dependency Testing]
  style B fill:#f90,color:#fff

Common Errors

  1. Resource provider not registered in the subscription: The Microsoft.Chaos provider must be registered in each subscription where you intend to run experiments. This is a one-time step per subscription.
  2. Targets not enabled before experiment creation: Each resource must be explicitly enabled as a chaos target. An experiment referencing a non-enabled target fails at validation time.
  3. Missing system-assigned managed identity on the experiment: Chaos Studio requires a managed identity to execute faults. The identity must have the appropriate RBAC role on target resources.
  4. Incorrect selector type for the target: List selectors require exact resource IDs. Query selectors use tags or names. Using the wrong selector type causes experiment validation failures.
  5. Fault duration exceeding the maximum limit: Some fault types have maximum duration limits. Check the documentation for specific fault constraints before setting long durations.

Practice Questions

  1. What is the difference between agentless and agent-based faults in Azure Chaos Studio?
  2. How do you enable a resource as a chaos target in Azure Chaos Studio?
  3. What role do Azure Monitor alerts play in Chaos Experiment safety?
  4. How do you define a multi-branch experiment in Azure Chaos Studio?
  5. What identity and RBAC configuration is required for Azure Chaos Studio experiments?

Challenge

Create an Azure Chaos Studio experiment that shuts down a single virtual machine in a scale set for 120 seconds. Configure an Azure Monitor alert on the scale set's running instance count as a safety guard with a threshold of 80 percent. Start the experiment, verify the auto-recovery behavior, and generate an execution report.

FAQ

What is Azure Chaos Studio?

Azure Chaos Studio is a managed Azure service for running controlled Fault Injection experiments to validate application resilience across Azure resources.

What fault types does Azure Chaos Studio support?

It supports AKS pod kills, VM shutdowns, network latency, CPU pressure, disk I/O delays, and Cosmos DB failures among others. Both agentless and agent-based fault types are available.

Do I need to install agents for Azure Chaos Studio?

Agentless faults need no installation. Agent-based faults (CPU pressure, disk I/O) require the Chaos Agent to be installed on the target VM.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro