Game Days — Running Chaos Drills with Your Team
In this tutorial, you'll learn about Game Days. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
A game day is a scheduled, structured Chaos Engineering drill where your team simulates a real incident to practice response procedures, test runbooks, and build confidence. Unlike automated chaos experiments, game days focus on the human and process aspects of reliability.
What You Will Learn
This tutorial teaches you how to plan, execute, and retrospect game days that improve your teams Incident Response capabilities.
Why It Matters
When a real outage hits there is no time to read the runbook for the first time. Game days build muscle memory. They reveal gaps in monitoring, communication channels, and escalation paths before a real incident occurs.
Real-World Use
DodaTech runs a quarterly game day where one team member plays the role of Chaos Monkey and injects faults while the on-call team responds. Each game day has rotated incident commander duties so every engineer gets experience leading a response.
Prerequisites
Before starting you should understand:
- Basic Chaos Engineering concepts and experiment design
- Incident Response fundamentals (alerting, escalation, communication)
- How your team currently handles on-call rotations
- Familiarity with Kubernetes debugging commands
Step 1: Define Game Day Objectives
Every game day needs specific learning goals. Avoid generic goals like "practice Incident Response." Instead set measurable objectives.
# game-day-objectives.yaml
objectives:
- "Test the database failover runbook within 10 minutes"
- "Verify that the on-call engineer escalates correctly when incident severity is P1"
- "Confirm that slack alerts reach the correct channel within 2 minutes of fault injection"
- "Identify any gaps in the deployment rollback procedure"
Step 2: Prepare the Scenario
Write a realistic scenario that matches the objectives. Include the injection plan, expected system behavior, and notes for the facilitator.
# scenario-db-failover.yaml
scenario:
title: "Primary Database Unreachable"
description: "The primary PostgreSQL instance becomes unresponsive due to a network partition"
injection:
- fault: network-partition
target: postgres-primary
duration: 15m
intensity: complete packet loss
expected_response:
- "On-call receives alert within 2 minutes"
- "Read replica promoted to primary within 10 minutes"
- "Application switches to read-only mode gracefully"
facilitator_notes:
- "Do not tell the team which fault will be injected"
- "If the team fails to promote the replica within 15 minutes call a timeout"
Step 3: Schedule and Communicate
Send a calendar invitation with the game day details at least one week in advance.
Game Day Notification Template:
What: Database Failover Game Day
When: 2026-06-28 14:00 UTC (duration 60 minutes)
Where: #incident-response Slack channel
Roles:
- Incident Commander: Alice
- Scribe: Bob
- Facilitator: Charlie
- Participants: Entire platform team
Expected outcomes: Tested failover runbook, identified gaps
Step 4: Execute the Game Day
The facilitator injects the fault while the team responds. The scribe documents everything.
# Facilitator injects the fault (hidden from participants)
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: game-day-db-partition
spec:
action: partition
mode: all
selector:
namespaces: ["production"]
labelSelectors:
app: postgres-primary
direction: both
duration: 15m
EOF
Expected output visible only to the facilitator:
networkchaos.chaos-mesh.org/game-day-db-partition created
Step 5: Run the Retrospective
After the game day ends immediately hold a blameless Retrospective. Focus on process improvements not individual performance.
# retrospective-template.yaml
retrospective:
what_went_well:
- "Runbook was found within 30 seconds"
- "Incident commander declared severity correctly"
what_could_improve:
- "Database credentials were not accessible to the secondary on-call"
- "Slack alert channel had notifications muted"
action_items:
- "Store database credentials in a shared password manager accessible to all on-call engineers"
- "Verify Slack notification settings for incident channels weekly"
Learning Path
flowchart LR A[Designing Experiments] --> B[Game Days] B --> C[Chaos Mesh Platform] C --> D[LitmusChaos] D --> E[Automated Pipeline] style B fill:#f90,color:#fff
Common Errors
- Making game days a surprise: Game days are drills not real incidents. Participants should know when the drill occurs but not the specific fault.
- Blaming individuals during retrospectives: Game days are about finding process gaps not evaluating people. Keep retrospectives blameless.
- Running game days too infrequently: Once a quarter is the minimum. Monthly is better. Skills atrophy without practice.
- Not rotating the incident commander role: Everyone should practice leading. If the same person always takes command you have a single point of failure.
- Skipping the scribe role: Without documentation the lessons learned disappear. Always assign a scribe.
Practice Questions
- What is the difference between a game day and an automated Chaos Experiment?
- Why should game day scenarios be kept secret from participants?
- What roles should be assigned for a game day?
- How do you measure the success of a game day?
- What is a blameless Retrospective and why is it important?
Challenge
Plan a complete game day for a Microservices application with five services. Write the scenario, assign roles, prepare the Fault Injection, define success metrics, and create a Retrospective template. Run the game day with your team and document the findings.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro