Skip to content

AWS Chaos Pipeline — Automated FIS Experiments with CI/CD

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about AWS Chaos Pipeline. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Building an automated AWS Chaos Engineering pipeline means connecting AWS Fault Injection Service (FIS) to your deployment pipeline, scheduling recurring experiments with EventBridge, and automating response actions with Lambda functions.

What You Will Learn

This tutorial teaches you how to build a fully automated chaos pipeline on AWS using FIS experiment templates, EventBridge schedules, Lambda-driven experiment triggers, and automated rollback when resilience checks fail.

Why It Matters

Manual experiments do not scale and are easy to skip when things get busy. An automated pipeline ensures every deployment goes through the same resilience checks. It catches regressions early and provides a consistent, auditable record of Resilience Testing for every release.

Real-World Use

DodaTech runs an automated FIS experiment every time a new version of the Durga Antivirus Pro scanning engine is deployed to the staging cluster. If the experiment fails, the pipeline automatically reverts to the previous version and pages the on-call engineer.

Prerequisites

Before starting you should understand:

  • AWS FIS basics from introductory tutorials
  • Chaos Engineering experiment design
  • AWS Lambda and EventBridge concepts
  • Basic CI/CD pipeline design

Step 1: Create an FIS Experiment Template for Automation

Design an experiment template that can be parameterized for different targets:

{
  "description": "Automated EC2 instance termination test",
  "targets": {
    "instanceTarget": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "ChaosTarget": "auto-test",
        "Environment": "staging"
      },
      "filters": [
        {
          "path": "State.Name",
          "values": ["running"]
        }
      ],
      "selectionMode": "COUNT(1)"
    }
  },
  "actions": {
    "terminateInstance": {
      "actionId": "aws:ec2:terminate-instances",
      "parameters": {},
      "targets": {
        "Instances": "instanceTarget"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm/FIS-Error-Rate-Alarm]
    }
  ],
  "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole"
}
aws fis create-experiment-template \
  --cli-input-json file://auto-fis-template.json

# Expected output:
# {
#     "experimentTemplate": {
#         "id": "ext-auto-001",
#         "description": "Automated EC2 instance termination test"
#     }
# }

Step 2: Trigger Experiments with EventBridge and Lambda

Create a Lambda function that starts experiments in response to deployment events:

#!/usr/bin/env python3
"""Lambda function to trigger FIS experiment after deployment."""
import boto3
import json
import os
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

fis = boto3.client("fis")
codepipeline = boto3.client("codepipeline")

EXPERIMENT_TEMPLATE_ID = os.environ["FIS_TEMPLATE_ID"]
CLOUDWATCH_ALARM_ARN = os.environ["STOP_ALARM_ARN"]

def lambda_handler(event, context):
    logger.info(f"Received event: {json.dumps(event)}")

    try:
        response = fis.start_experiment(
            experimentTemplateId=EXPERIMENT_TEMPLATE_ID,
            tags={
                "TriggeredBy": "DeploymentPipeline",
                "Environment": "staging"
            }
        )

        experiment_id = response["experiment"]["id"]
        logger.info(f"Started experiment: {experiment_id}")

        return {
            "statusCode": 200,
            "body": json.dumps({
                "experimentId": experiment_id,
                "status": "running"
            })
        }

    except Exception as e:
        logger.error(f"Failed to start experiment: {str(e)}")
        return {
            "statusCode": 500,
            "body": json.dumps({"error": str(e)})
        }

# Expected CloudWatch log output:
# START RequestId: abc-123
# Received event: {"detail-type": "Pipeline Execution State Change"}
# Started experiment: exp-xyz-789
# END RequestId: abc-123

Step 3: Schedule Weekly Experiments with EventBridge

Run experiments on a recurring schedule:

# Create an EventBridge rule for weekly experiments
aws events put-rule \
  --name WeeklyChaosExperiment \
  --schedule-expression "cron(0 3 ? * SUN *)" \
  --state ENABLED

# Expected output:
# {
#     "RuleArn": "arn:aws:events:us-east-1:123456789012:rule/WeeklyChaosExperiment"
# }

# Register the Lambda function as a target
aws events put-targets \
  --rule WeeklyChaosExperiment \
  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:RunChaosExperiment"

# Expected output:
# {
#     "FailedEntryCount": 0,
#     "FailedEntries": []
# }

Step 4: Build a CI/CD Pipeline Step

Integrate FIS experiments into a CodePipeline stage:

# buildspec-chaos.yaml
version: 0.2
phases:
  install:
    commands:
      - pip install boto3 pyyaml
  pre_build:
    commands:
      - echo "Checking experiment template exists..."
      - aws fis list-experiment-templates --query "experimentTemplates[?id=='ext-auto-001']"
  build:
    commands:
      - echo "Starting chaos experiment..."
      - EXPERIMENT_ID=$(aws fis start-experiment --experiment-template-id ext-auto-001 --query 'experiment.id' --output text)
      - echo "Experiment started: $EXPERIMENT_ID"
      - echo "Waiting for experiment to complete..."
      - |
        while true; do
          STATUS=$(aws fis get-experiment --id $EXPERIMENT_ID --query 'experiment.state.status' --output text)
          echo "Status: $STATUS"
          if [ "$STATUS" = "completed" ]; then
            echo "Experiment completed successfully"
            break
          elif [ "$STATUS" = "failed" ] || [ "$STATUS" = "stopped" ]; then
            echo "Experiment failed or was stopped"
            exit 1
          fi
          sleep 10
        done
  post_build:
    commands:
      - echo "Chaos experiment validation complete"

Step 5: Automate Rollback on Failure

Add a Lambda function that triggers rollback when a guardrail alarm fires:

#!/usr/bin/env python3
"""Lambda to handle FIS experiment failure notifications."""
import boto3
import json
import os

codepipeline = boto3.client("codepipeline")
sns = boto3.client("sns")

SNS_TOPIC_ARN = os.environ["ONCALL_TOPIC_ARN"]
PIPELINE_NAME = os.environ["DEPLOYMENT_PIPELINE"]

def lambda_handler(event, context):
    alarm_name = event["detail"]["alarmName"]
    new_state = event["detail"]["state"]["value"]

    if new_state == "ALARM":
        message = (
            f"Chaos experiment guardrail triggered.\n"
            f"Alarm: {alarm_name}\n"
            f"Pipeline: {PIPELINE_NAME}\n"
            f"Initiating automated rollback."
        )

        # Notify the on-call engineer
        sns.publish(
            TopicArn=SNS_TOPIC_ARN,
            Subject="Chaos Experiment Guardrail Breach",
            Message=message
        )

        # Stop the pipeline
        codepipeline.stop_pipeline_execution(
            pipelineName=PIPELINE_NAME,
            pipelineExecutionId=event.get("pipelineExecutionId", ""),
            abandon=True
        )

        return {"status": "rollback_initiated"}

    return {"status": "no_action"}

# Expected CloudWatch log output:
# Guardrail breached for experiment exp-xyz-789
# On-call engineer notified via SNS topic arn:...
# Pipeline rollback initiated.

Learning Path

flowchart LR
  A[AWS FIS Basics] --> B[AWS Chaos Pipeline]
  B --> C[Azure Chaos Pipeline]
  C --> D[Kubernetes Chaos Testing]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

  1. Missing IAM pass role permissions: The FIS role needs iam:PassRole permission for the Lambda execution role. Without it the experiment fails with an authorization error.
  2. Experiment template references hardcoded resource ARNs: Use tags and selectionMode COUNT instead of hardcoded ARNs so the template works across environments.
  3. EventBridge rule without proper resource-based policy: The rule target (Lambda) must have a resource-based policy allowing EventBridge to invoke it.
  4. Pipeline timeout shorter than experiment duration: Ensure the CodePipeline timeout exceeds the maximum experiment duration plus a buffer for startup and teardown.
  5. Not handling experiment already running errors: Use idempotency checks or concurrency limits to prevent overlapping automated experiments.

Practice Questions

  1. How do you parameterize FIS experiment templates for different environments?
  2. What AWS services can trigger automated FIS experiments?
  3. How does a CI/CD pipeline wait for a Chaos Experiment to complete?
  4. What happens when a guardrail alarm fires during an automated experiment?
  5. Why should you use resource tags instead of ARNs in experiment templates?

Challenge

Build a complete CI/CD pipeline with CodePipeline that deploys an application to an EC2 Auto Scaling group and then runs an FIS experiment that terminates one instance. Configure a CloudWatch alarm on the healthy instance count. If the alarm fires, the pipeline must stop and send an SNS notification. If the experiment passes, the pipeline promotes the build to production.

FAQ

What is an automated chaos pipeline?

An automated chaos pipeline runs chaos experiments as part of the CI/CD process, automatically triggering experiments after deployments and blocking or rolling back releases that fail resilience checks.

What AWS services are needed for an automated chaos pipeline?

AWS FIS for Fault Injection, EventBridge for scheduling and event triggers, Lambda for custom logic, CloudWatch for monitoring, and CodePipeline for CI/CD Orchestration.

How do I prevent overlapping automated experiments?

Use the FIS experiment template concurrency limit or a DynamoDB-based state lock in the Lambda trigger function to prevent multiple experiments from running simultaneously.

What metrics should I monitor during automated experiments?

Monitor error rate, p99 latency, CPU utilization, memory pressure, and application-specific SLOs. Configure CloudWatch alarms on each metric as FIS stop conditions.

Can I run automated experiments in production?

Yes, with conservative Blast Radius settings, proper stop conditions, staged rollout (first in staging), and a proven track record of passing experiments in lower environments.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro