AWS Chaos Engineering — Fault Injection Service for Cloud Workloads
In this tutorial, you'll learn about AWS Chaos Engineering. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
AWS Fault Injection Service (FIS) is a managed Chaos Engineering service that enables you to run controlled Fault Injection experiments on AWS workloads. It provides pre-built fault actions for EC2, ECS, EKS, RDS, and DynamoDB with native integration into AWS IAM, CloudWatch, and Systems Manager.
What You Will Learn
This tutorial teaches you how to create AWS FIS experiment templates, target different AWS resource types, configure CloudWatch stop conditions, run experiments from CLI and SDK, and analyze results.
Why It Matters
AWS FIS eliminates the need to install and maintain third-party Chaos Engineering tools on AWS. Experiments are defined using the same IAM roles, CloudWatch alarms, and tagging conventions you already manage. This reduces operational overhead and makes chaos experiments auditable through AWS CloudTrail.
Real-World Use
DodaTech uses AWS FIS to validate the resilience of Durga Antivirus Pro scanning clusters running on EC2 Spot Instances. FIS experiments verify that the cluster can absorb instance terminations without interrupting ongoing malware scans, saving thousands of dollars per month in compute costs.
Prerequisites
Before starting you should understand:
- AWS console navigation, IAM roles, and EC2 basics
- Chaos Engineering concepts (hypothesis, Steady State, blast radius)
- How Kubernetes or ECS workloads are structured on AWS
- CloudWatch alarms for monitoring
Step 1: Create the IAM Role
FIS needs an IAM role with permissions to perform actions on your resources.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:TerminateInstances",
"ec2:StopInstances",
"ec2:RebootInstances",
"ec2:DescribeInstances]
],
"Resource": "arn:aws:ec2:us-east-1:*:instance/*"
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:DescribeAlarms",
"cloudwatch:PutMetricData]
],
"Resource": "*"
}
]
}
# Create the IAM role for FIS
aws iam create-role \
--role-name FISExperimentRole \
--assume-role-policy-document file://fis-trust-policy.json
# Expected output:
# {
# "Role": {
# "RoleName": "FISExperimentRole",
# "Arn": "arn:aws:iam::123456789012:role/FISExperimentRole"
# }
# }
Step 2: Create an Experiment Template
Define an experiment that stops a single EC2 instance for 60 seconds.
{
"description": "Stop one EC2 instance for 60 seconds",
"targets": {
"instanceTarget": {
"resourceType": "aws:ec2:instance",
"resourceArns": [
"arn:aws:ec2:us-east-1:123456789012:instance/i-0abc123def456]
],
"selectionMode": "ALL"
}
},
"actions": {
"stopInstance": {
"actionId": "aws:ec2:stop-instances",
"parameters": {},
"targets": {
"Instances": "instanceTarget"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:FIS-ErrorRate-Alarm]
}
],
"roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole"
}
aws fis create-experiment-template \
--cli-input-json file://ec2-stop-template.json
# Expected output:
# {
# "experimentTemplate": {
# "id": "ext-abc123def456",
# "description": "Stop one EC2 instance for 60 seconds"
# }
# }
Step 3: Set Stop Conditions
Create a CloudWatch alarm that automatically stops the experiment if error rates exceed a threshold.
# Create a CloudWatch alarm as stop condition
aws cloudwatch put-metric-alarm \
--alarm-name FIS-ErrorRate-Alarm \
--alarm-description "Stop FIS experiment if error rate exceeds 5%" \
--metric-name ErrorRate \
--namespace AWS/FIS \
--statistic Average \
--period 60 \
--threshold 5.0 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1
# Expected output: (success returns no output)
Step 4: Start and Monitor the Experiment
Run the experiment and track its status.
# Start the experiment
aws fis start-experiment \
--experiment-template-id ext-abc123def456
# Expected output:
# {
# "experiment": {
# "id": "exp-xyz789ghi012",
# "experimentTemplateId": "ext-abc123def456",
# "state": {"status": "running"}
# }
# }
# Monitor experiment status
aws fis get-experiment --id exp-xyz789ghi012
# Expected output:
# {
# "experiment": {
# "id": "exp-xyz789ghi012",
# "state": {"status": "completed"},
# "actions": [{"actionId": "aws:ec2:stop-instances", "state": "completed"}]
# }
# }
Step 5: Analyze and Report
Use the AWS SDK to analyze experiment results programmatically.
import boto3
import json
fis = boto3.client("fis")
experiment_id = "exp-xyz789ghi012"
response = fis.get_experiment(id=experiment_id)
state = response["experiment"]["state"]["status"]
actions = response["experiment"]["actions"]
print(f"Experiment: {experiment_id}")
print(f"Status: {state}")
for action_id, action in actions.items():
print(f" Action: {action['actionId']} -> {action['state']}")
stop_condition = response["experiment"].get("stopConditions", [])
if stop_condition:
print(f" Stop condition triggered: {stop_condition[0]['source']}")
else:
print(" No stop condition triggered (experiment ran full duration)")
Expected output:
Experiment: exp-xyz789ghi012
Status: completed
Action: aws:ec2:stop-instances -> completed
No stop condition triggered (experiment ran full duration)
Learning Path
flowchart LR A[Gremlin] --> B[AWS Fault Injection] B --> C[Azure Chaos Studio] C --> D[Latency Injection] D --> E[Fault Injection Proxy] style B fill:#f90,color:#fff
Common Errors
- Insufficient IAM permissions on the FIS role: The role must include permissions for every action the experiment performs. Missing ec2:TerminateInstances causes silent failures.
- Missing stop conditions on production experiments: Always configure at least one CloudWatch alarm as a stop condition. Without it experiments may run longer than intended.
- Targeting resources in the wrong AWS region: FIS experiments are region-scoped. Ensure resource ARNs match the region where the experiment template was created.
- Using incorrect resource ARN formats: ARNs must include the account ID, region, and resource ID. Use
aws ec2 describe-instancesto verify ARN formats. - Not testing stop conditions before running: Create a test alarm that triggers immediately and verify the experiment stops correctly before running real experiments.
Practice Questions
- What IAM role configuration is required before creating an AWS FIS experiment?
- How do stop conditions work in AWS FIS and why are they important?
- What AWS resource types can be targeted with FIS experiments?
- How do you monitor an active FIS experiment from the CLI?
- What happens when a stop condition alarm is triggered during a running experiment?
Challenge
Create an AWS FIS experiment that terminates one EC2 instance in an Auto Scaling group with a minimum of three instances. Configure a CloudWatch alarm on the group's healthy instance count as a stop condition with a threshold of 2. Start the experiment and verify the Auto Scaling group launches a replacement instance within five minutes.
FAQ
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro