AWS Fault Injection Service — Testing AWS Workloads
In this tutorial, you'll learn about AWS Fault Injection Service. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
AWS Fault Injection Service (FIS) is a managed Chaos Engineering service that makes it easy to run Fault Injection experiments on AWS workloads. It provides pre-built fault templates for EC2, ECS, EKS, RDS, and other AWS services.
What You Will Learn
This tutorial teaches you how to use AWS FIS to create experiments, define action sequences, set safety controls, and run chaos experiments against your AWS infrastructure.
Why It Matters
AWS FIS removes the need to install and maintain Chaos Engineering tools. It integrates natively with AWS IAM, CloudWatch, and Systems Manager. You can run experiments against EC2 instances, ECS tasks, EKS pods, and RDS databases without any third-party agents.
Real-World Use
DodaTech uses AWS FIS to test the resilience of Durga Antivirus Pro scanning clusters running on EC2 Spot Instances. FIS experiments verify that the cluster can absorb instance terminations without interrupting ongoing malware scans.
Prerequisites
Before starting you should understand:
- AWS console navigation and IAM basics
- Chaos Engineering concepts (hypothesis, Steady State, Blast Radius)
- How EC2, ECS, or EKS workloads are structured in AWS
Step 1: Set Up IAM Permissions
Create an IAM role that allows FIS to perform actions on your resources:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:TerminateInstances",
"ec2:StopInstances",
"ec2:RebootInstances]
],
"Resource": "arn:aws:ec2:us-east-1:*:instance/*"
}
]
}
Attach this policy to a role named FISExperimentRole and add FIS as a trusted entity.
Step 2: Create an Experiment Template
Navigate to the AWS FIS console and create an experiment template:
# Using AWS CLI to create an experiment template
aws fis create-experiment-template \
--cli-input-json file://ec2-stop-template.json
Contents of ec2-stop-template.json:
{
"description": "Stop a single EC2 instance for 60 seconds",
"targets": {
"instanceTarget": {
"resourceType": "aws:ec2:instance",
"resourceArns": ["arn:aws:ec2:us-east-1:123456789012:instance/i-0abc123def456"]
}
},
"actions": {
"stopInstance": {
"actionId": "aws:ec2:stop-instances",
"parameters": {},
"targets": {
"Instances": "instanceTarget"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:FISErrorRateAlarm]
}
],
"roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole"
}
Expected output:
{
"experimentTemplate": {
"id": "ext-abc123def456",
"description": "Stop a single EC2 instance for 60 seconds"
}
}
Step 3: Set Stop Conditions
Stop conditions are CloudWatch alarms that halt the experiment automatically:
# Create a CloudWatch alarm that will stop the experiment
aws cloudwatch put-metric-alarm \
--alarm-name FISErrorRateAlarm \
--alarm-description "Stop FIS experiment if error rate exceeds 5%" \
--metric-name ErrorRate \
--namespace AWS/FIS \
--statistic Average \
--period 60 \
--threshold 5.0 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1
# Expected output (no output on success)
Step 4: Start the Experiment
Run the experiment from the console or CLI:
aws fis start-experiment \
--experiment-template-id ext-abc123def456
# Expected output:
# {
# "experiment": {
# "id": "exp-xyz789ghi012",
# "experimentTemplateId": "ext-abc123def456",
# "state": {
# "status": "running"
# }
# }
# }
Step 5: Monitor the Experiment
Track the experiment status in real time:
aws fis get-experiment \
--id exp-xyz789ghi012
# Expected output:
# {
# "experiment": {
# "id": "exp-xyz789ghi012",
# "state": {
# "status": "completed"
# },
# "actions": [
# {
# "actionId": "aws:ec2:stop-instances",
# "state": "completed]
# }
# ]
# }
# }
Learning Path
flowchart LR A[Gremlin Platform] --> B[AWS Fault Injection Service] B --> C[Azure Chaos Studio] C --> D[Latency Injection] D --> E[Fault Injection Proxy] style B fill:#f90,color:#fff
Common Errors
- Insufficient IAM permissions for the FIS role: The FIS role must have permissions for the actions it will perform. Check the IAM policy carefully.
- Missing or misconfigured stop conditions: Without stop conditions an experiment might run longer than intended. Always configure at least one CloudWatch alarm.
- Targeting production resources accidentally: Double-check resource ARNs before starting an experiment. Use tags to identify safe targets.
- Experiment fails because target resource is already stopped: FIS cannot stop an already stopped instance. Verify the target state before starting.
- Cross-region resource ARN mismatch: Ensure resource ARNs match the region where the experiment runs. ARNs are region-specific.
Practice Questions
- What IAM role configuration is required for AWS FIS experiments?
- How do stop conditions work in AWS FIS?
- What AWS resources can you target with FIS experiments?
- How do you create an experiment template using the AWS CLI?
- What happens when a stop condition alarm is triggered?
Challenge
Create an AWS FIS experiment that terminates one EC2 instance in an Auto Scaling group. Configure a CloudWatch alarm on the groups healthy instance count as a stop condition. Start the experiment and verify that the Auto Scaling group launches a replacement instance within five minutes.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro