AWS Chaos Engineering — Fault Injection Service for Cloud Workloads

Q: Which AWS services does FIS support?

FIS supports EC2, ECS, EKS, RDS, DynamoDB , and more. The supported action list grows with each AWS release.

DodaTech Updated 2026-06-23 5 min read

In this tutorial, you'll learn about AWS Chaos Engineering. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

AWS Fault Injection Service (FIS) is a managed Chaos Engineering service that enables you to run controlled Fault Injection experiments on AWS workloads. It provides pre-built fault actions for EC2, ECS, EKS, RDS, and DynamoDB with native integration into AWS IAM, CloudWatch, and Systems Manager.

What You Will Learn

This tutorial teaches you how to create AWS FIS experiment templates, target different AWS resource types, configure CloudWatch stop conditions, run experiments from CLI and SDK, and analyze results.

Why It Matters

AWS FIS eliminates the need to install and maintain third-party Chaos Engineering tools on AWS. Experiments are defined using the same IAM roles, CloudWatch alarms, and tagging conventions you already manage. This reduces operational overhead and makes chaos experiments auditable through AWS CloudTrail.

Real-World Use

DodaTech uses AWS FIS to validate the resilience of Durga Antivirus Pro scanning clusters running on EC2 Spot Instances. FIS experiments verify that the cluster can absorb instance terminations without interrupting ongoing malware scans, saving thousands of dollars per month in compute costs.

Prerequisites

Before starting you should understand:

AWS console navigation, IAM roles, and EC2 basics
Chaos Engineering concepts (hypothesis, Steady State, blast radius)
How Kubernetes or ECS workloads are structured on AWS
CloudWatch alarms for monitoring

Step 1: Create the IAM Role

FIS needs an IAM role with permissions to perform actions on your resources.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:TerminateInstances",
        "ec2:StopInstances",
        "ec2:RebootInstances",
        "ec2:DescribeInstances]
      ],
      "Resource": "arn:aws:ec2:us-east-1:*:instance/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:DescribeAlarms",
        "cloudwatch:PutMetricData]
      ],
      "Resource": "*"
    }
  ]
}

# Create the IAM role for FIS
aws iam create-role \
  --role-name FISExperimentRole \
  --assume-role-policy-document file://fis-trust-policy.json

# Expected output:
# {
#     "Role": {
#         "RoleName": "FISExperimentRole",
#         "Arn": "arn:aws:iam::123456789012:role/FISExperimentRole"
#     }
# }

Step 2: Create an Experiment Template

Define an experiment that stops a single EC2 instance for 60 seconds.

{
  "description": "Stop one EC2 instance for 60 seconds",
  "targets": {
    "instanceTarget": {
      "resourceType": "aws:ec2:instance",
      "resourceArns": [
        "arn:aws:ec2:us-east-1:123456789012:instance/i-0abc123def456]
      ],
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "stopInstance": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {},
      "targets": {
        "Instances": "instanceTarget"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:FIS-ErrorRate-Alarm]
    }
  ],
  "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole"
}

aws fis create-experiment-template \
  --cli-input-json file://ec2-stop-template.json

# Expected output:
# {
#     "experimentTemplate": {
#         "id": "ext-abc123def456",
#         "description": "Stop one EC2 instance for 60 seconds"
#     }
# }

Step 3: Set Stop Conditions

Create a CloudWatch alarm that automatically stops the experiment if error rates exceed a threshold.

# Create a CloudWatch alarm as stop condition
aws cloudwatch put-metric-alarm \
  --alarm-name FIS-ErrorRate-Alarm \
  --alarm-description "Stop FIS experiment if error rate exceeds 5%" \
  --metric-name ErrorRate \
  --namespace AWS/FIS \
  --statistic Average \
  --period 60 \
  --threshold 5.0 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1

# Expected output: (success returns no output)

Step 4: Start and Monitor the Experiment

Run the experiment and track its status.

# Start the experiment
aws fis start-experiment \
  --experiment-template-id ext-abc123def456

# Expected output:
# {
#     "experiment": {
#         "id": "exp-xyz789ghi012",
#         "experimentTemplateId": "ext-abc123def456",
#         "state": {"status": "running"}
#     }
# }

# Monitor experiment status
aws fis get-experiment --id exp-xyz789ghi012

# Expected output:
# {
#     "experiment": {
#         "id": "exp-xyz789ghi012",
#         "state": {"status": "completed"},
#         "actions": [{"actionId": "aws:ec2:stop-instances", "state": "completed"}]
#     }
# }

Step 5: Analyze and Report

Use the AWS SDK to analyze experiment results programmatically.

import boto3
import json

fis = boto3.client("fis")

experiment_id = "exp-xyz789ghi012"
response = fis.get_experiment(id=experiment_id)

state = response["experiment"]["state"]["status"]
actions = response["experiment"]["actions"]

print(f"Experiment: {experiment_id}")
print(f"Status: {state}")

for action_id, action in actions.items():
    print(f"  Action: {action['actionId']} -> {action['state']}")

stop_condition = response["experiment"].get("stopConditions", [])
if stop_condition:
    print(f"  Stop condition triggered: {stop_condition[0]['source']}")
else:
    print("  No stop condition triggered (experiment ran full duration)")

Expected output:

Experiment: exp-xyz789ghi012
Status: completed
  Action: aws:ec2:stop-instances -> completed
  No stop condition triggered (experiment ran full duration)

Learning Path

flowchart LR
  A[Gremlin] --> B[AWS Fault Injection]
  B --> C[Azure Chaos Studio]
  C --> D[Latency Injection]
  D --> E[Fault Injection Proxy]
  style B fill:#f90,color:#fff

Common Errors

Insufficient IAM permissions on the FIS role: The role must include permissions for every action the experiment performs. Missing ec2:TerminateInstances causes silent failures.
Missing stop conditions on production experiments: Always configure at least one CloudWatch alarm as a stop condition. Without it experiments may run longer than intended.
Targeting resources in the wrong AWS region: FIS experiments are region-scoped. Ensure resource ARNs match the region where the experiment template was created.
Using incorrect resource ARN formats: ARNs must include the account ID, region, and resource ID. Use aws ec2 describe-instances to verify ARN formats.
Not testing stop conditions before running: Create a test alarm that triggers immediately and verify the experiment stops correctly before running real experiments.

Practice Questions

What IAM role configuration is required before creating an AWS FIS experiment?
How do stop conditions work in AWS FIS and why are they important?
What AWS resource types can be targeted with FIS experiments?
How do you monitor an active FIS experiment from the CLI?
What happens when a stop condition alarm is triggered during a running experiment?

Challenge

Create an AWS FIS experiment that terminates one EC2 instance in an Auto Scaling group with a minimum of three instances. Configure a CloudWatch alarm on the group's healthy instance count as a stop condition with a threshold of 2. Start the experiment and verify the Auto Scaling group launches a replacement instance within five minutes.

FAQ

What is AWS Fault Injection Service?

AWS FIS is a managed Chaos Engineering service that lets you run controlled Fault Injection experiments on AWS workloads using pre-built templates and CloudWatch safety controls.

Which AWS services does FIS support?

FIS supports EC2, ECS, EKS, RDS, DynamoDB, and more. The supported action list grows with each AWS release.

Do I need to install agents to use AWS FIS?

No. FIS uses AWS APIs to inject faults. No agents are required on target resources.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Gremlin Platform — Managed Chaos Engineering for Production Systems Next → Azure Chaos Studio Guide — Managed Fault Injection for Azure Resources

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering