Game Days — Running Chaos Drills with Your Team

Q: Who should participate in game days?

All engineers who might be involved in Incident Response : developers, SREs, DevOps engineers, and database administrators.

DodaTech Updated 2026-06-21 5 min read

In this tutorial, you'll learn about Game Days. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

A game day is a scheduled, structured Chaos Engineering drill where your team simulates a real incident to practice response procedures, test runbooks, and build confidence. Unlike automated chaos experiments, game days focus on the human and process aspects of reliability.

What You Will Learn

This tutorial teaches you how to plan, execute, and retrospect game days that improve your teams Incident Response capabilities.

Why It Matters

When a real outage hits there is no time to read the runbook for the first time. Game days build muscle memory. They reveal gaps in monitoring, communication channels, and escalation paths before a real incident occurs.

Real-World Use

DodaTech runs a quarterly game day where one team member plays the role of Chaos Monkey and injects faults while the on-call team responds. Each game day has rotated incident commander duties so every engineer gets experience leading a response.

Prerequisites

Before starting you should understand:

Basic Chaos Engineering concepts and experiment design
Incident Response fundamentals (alerting, escalation, communication)
How your team currently handles on-call rotations
Familiarity with Kubernetes debugging commands

Step 1: Define Game Day Objectives

Every game day needs specific learning goals. Avoid generic goals like "practice Incident Response." Instead set measurable objectives.

# game-day-objectives.yaml
objectives:
  - "Test the database failover runbook within 10 minutes"
  - "Verify that the on-call engineer escalates correctly when incident severity is P1"
  - "Confirm that slack alerts reach the correct channel within 2 minutes of fault injection"
  - "Identify any gaps in the deployment rollback procedure"

Step 2: Prepare the Scenario

Write a realistic scenario that matches the objectives. Include the injection plan, expected system behavior, and notes for the facilitator.

# scenario-db-failover.yaml
scenario:
  title: "Primary Database Unreachable"
  description: "The primary PostgreSQL instance becomes unresponsive due to a network partition"
  injection:
    - fault: network-partition
      target: postgres-primary
      duration: 15m
      intensity: complete packet loss
  expected_response:
    - "On-call receives alert within 2 minutes"
    - "Read replica promoted to primary within 10 minutes"
    - "Application switches to read-only mode gracefully"
  facilitator_notes:
    - "Do not tell the team which fault will be injected"
    - "If the team fails to promote the replica within 15 minutes call a timeout"

Step 3: Schedule and Communicate

Send a calendar invitation with the game day details at least one week in advance.

Game Day Notification Template:

What: Database Failover Game Day
When: 2026-06-28 14:00 UTC (duration 60 minutes)
Where: #incident-response Slack channel
Roles:
  - Incident Commander: Alice
  - Scribe: Bob
  - Facilitator: Charlie
  - Participants: Entire platform team
Expected outcomes: Tested failover runbook, identified gaps

Step 4: Execute the Game Day

The facilitator injects the fault while the team responds. The scribe documents everything.

# Facilitator injects the fault (hidden from participants)
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: game-day-db-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces: ["production"]
    labelSelectors:
      app: postgres-primary
  direction: both
  duration: 15m
EOF

Expected output visible only to the facilitator:

networkchaos.chaos-mesh.org/game-day-db-partition created

Step 5: Run the Retrospective

After the game day ends immediately hold a blameless Retrospective. Focus on process improvements not individual performance.

# retrospective-template.yaml
retrospective:
  what_went_well:
    - "Runbook was found within 30 seconds"
    - "Incident commander declared severity correctly"
  what_could_improve:
    - "Database credentials were not accessible to the secondary on-call"
    - "Slack alert channel had notifications muted"
  action_items:
    - "Store database credentials in a shared password manager accessible to all on-call engineers"
    - "Verify Slack notification settings for incident channels weekly"

Learning Path

flowchart LR
  A[Designing Experiments] --> B[Game Days]
  B --> C[Chaos Mesh Platform]
  C --> D[LitmusChaos]
  D --> E[Automated Pipeline]
  style B fill:#f90,color:#fff

Common Errors

Making game days a surprise: Game days are drills not real incidents. Participants should know when the drill occurs but not the specific fault.
Blaming individuals during retrospectives: Game days are about finding process gaps not evaluating people. Keep retrospectives blameless.
Running game days too infrequently: Once a quarter is the minimum. Monthly is better. Skills atrophy without practice.
Not rotating the incident commander role: Everyone should practice leading. If the same person always takes command you have a single point of failure.
Skipping the scribe role: Without documentation the lessons learned disappear. Always assign a scribe.

Practice Questions

What is the difference between a game day and an automated Chaos Experiment?
Why should game day scenarios be kept secret from participants?
What roles should be assigned for a game day?
How do you measure the success of a game day?
What is a blameless Retrospective and why is it important?

Challenge

Plan a complete game day for a Microservices application with five services. Write the scenario, assign roles, prepare the Fault Injection, define success metrics, and create a Retrospective template. Run the game day with your team and document the findings.

FAQ

How long should a game day last?

60 to 90 minutes is ideal. More than 90 minutes causes fatigue. Less than 60 minutes does not leave enough time for a full incident lifecycle.

How often should we run game days?

Quarterly is the minimum recommended cadence. Monthly is better for teams that are actively building resilience.

Who should participate in game days?

All engineers who might be involved in Incident Response: developers, SREs, DevOps engineers, and database administrators.

What if the team fails the game day scenario?

That is the best outcome. It means you found a gap. Fix the gap and rerun the same scenario in the next game day to confirm improvement.

Should game days be run in production?

Start in staging. Graduate to production only when the team consistently passes staging scenarios with no issues.

← Previous Designing Chaos Experiments — From Idea to Execution Next → Chaos Mesh — Kubernetes Chaos Engineering Platform

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Chaos Engineering