Skip to content

Chaos Engineering

Chaos engineering tutorials — chaos principles, steady-state hypothesis, blast radius, Chaos Mesh, Litmus, Gremlin, AWS Fault Injection, latency injection, fault proxies, resilience testing, and game days

91 Published

In this tutorial, you will learn about Chaos Engineering. We cover key concepts, practical examples, and best practices to help you master this topic.

Comprehensive chaos engineering tutorials covering everything from qubits and Superposition to advanced algorithms and real-world applications.

Additional Classic Tutorials

Advanced Chaos Experiments -- Multi-Fault & Orchestrated Testing
AWS Chaos Engineering -- Fault Injection Service for Cloud Workloads
AWS Chaos Pipeline -- Automated FIS Experiments with CI/CD
AWS Fault Injection Service -- Testing AWS Workloads
Azure Chaos Pipeline -- Automated Experiments with DevOps
Azure Chaos Studio -- Chaos Experiments on Azure
Azure Chaos Studio Guide -- Managed Fault Injection for Azure Resources
Blast Radius -- Minimizing Impact of Chaos Experiments
Chaos Engineering Overview -- Building Resilient Systems
Chaos Engineering Pipeline -- Automating Experiments in CI/CD
Designing Chaos Experiments -- From Idea to Execution
Chaos Mesh -- Kubernetes Chaos Engineering Platform
Chaos Mesh Advanced -- Workflows, Schedules & Custom Faults
Chaos Mesh on Kubernetes -- Practical Fault Injection Guide
Observability in Chaos Engineering -- Metrics, Traces & Logs
Chaos Engineering Principles -- Steady State & Hypothesis
Database Chaos Engineering -- PostgreSQL, MySQL & Redis Resilience
Database Chaos -- Connection Drops, Replication Lag & Corruption
Dependency Testing -- Testing External Service Failures
Designing Chaos Experiments -- Structured Fault Injection for Resilient Systems
Fault Injection Proxy -- Toxiproxy & Service Mesh Chaos
Game Days -- Running Chaos Drills with Your Team
Gremlin Advanced -- Scenarios, Containers & API Automation
Gremlin Platform -- Managed Chaos Engineering for Production Systems
Gremlin Platform -- Managed Chaos Engineering Service
Infrastructure Faults -- CPU, Memory, Disk & Node Failures
Kubernetes Chaos -- Pod Failures, DNS Issues & Resource Pressure
Kubernetes Chaos Testing -- Pod, Node & Cluster Resilience
Latency Injection -- Simulating Network & Service Delays
LitmusChaos Advanced -- Workflows, GitOps & Resilience Scores
LitmusChaos Guide -- Cloud-Native Chaos Engineering for Kubernetes
LitmusChaos -- Cloud-Native Chaos Engineering
Network Chaos Testing -- Latency, Packet Loss & Bandwidth Limits
Network Partitioning -- Simulating Split-Brain Scenarios
Resilience Testing -- Circuit Breakers, Retries & Timeouts
Steady State Hypothesis -- Defining Normal Behavior

Published Topics

Chaos Engineering Overview — Building Resilient Systems

Learn chaos engineering fundamentals: running controlled experiments on distributed systems to build confidence in production resilience and fault tolerance.

✓ Live

Chaos Engineering Principles — Steady State & Hypothesis

Master the core principles of chaos engineering: defining steady state, forming hypotheses about system behavior, and running controlled experiments to verify resilience.

✓ Live

Blast Radius — Minimizing Impact of Chaos Experiments

Learn blast radius concepts for chaos engineering: how to limit experiment scope, use safe guardrails, and expand gradually from staging to production environments.

✓ Live

Steady State Hypothesis — Defining Normal Behavior

Learn how to define steady state hypotheses for chaos experiments: selecting metrics, setting thresholds, and writing falsifiable predictions about system behavior under fault conditions.

✓ Live

Designing Chaos Experiments — From Idea to Execution

Learn how to design chaos experiments step by step: from identifying system weaknesses to writing hypotheses, selecting faults, executing safely, and analyzing results.

✓ Live

Game Days — Running Chaos Drills with Your Team

Learn how to plan and run game days: structured chaos engineering drills where teams practice incident response, test runbooks, and build muscle memory for real outages.

✓ Live

Chaos Mesh — Kubernetes Chaos Engineering Platform

Learn Chaos Mesh: an open-source Kubernetes chaos engineering platform for injecting faults into pods, networks, and systems with safe experimentation controls.

✓ Live

LitmusChaos — Cloud-Native Chaos Engineering

Learn LitmusChaos: a cloud-native chaos engineering platform for Kubernetes with workflow orchestration, GitOps integration, and automated resilience scoring.

✓ Live

Gremlin Platform — Managed Chaos Engineering Service

Learn Gremlin: a managed chaos engineering platform for simulating real-world failures with safe attack types, guardrails, and team collaboration features.

✓ Live

AWS Fault Injection Service — Testing AWS Workloads

Learn AWS Fault Injection Service (FIS): a managed chaos engineering service for testing AWS workloads with pre-built fault templates and safety controls.

✓ Live

Azure Chaos Studio — Chaos Experiments on Azure

Learn Azure Chaos Studio: a managed chaos engineering service for running faults on Azure resources with role-based access control and safety guardrails.

✓ Live

Latency Injection — Simulating Network & Service Delays

Learn latency injection techniques for chaos engineering: simulating network delays with tc, proxy-based methods, and platform tools to test timeout and retry behavior.

✓ Live

Fault Injection Proxy — Toxiproxy & Service Mesh Chaos

Learn fault injection proxy techniques using Toxiproxy and service mesh tools to simulate service failures, latency, and network degradation in application-layer testing.

✓ Live

Dependency Testing — Testing External Service Failures

Learn dependency testing techniques in chaos engineering: simulating external API failures, downstream service outages, and third-party service degradation scenarios.

✓ Live

Resilience Testing — Circuit Breakers, Retries & Timeouts

Learn resilience testing patterns: verifying circuit breakers, retry logic, and timeout configurations through chaos experiments to build robust distributed systems.

✓ Live

Database Chaos — Connection Drops, Replication Lag & Corruption

Learn database chaos engineering techniques: simulating connection drops, replication lag, data corruption, and connection pool exhaustion to test database resilience.

✓ Live

Network Partitioning — Simulating Split-Brain Scenarios

Learn network partitioning chaos engineering: simulating split-brain scenarios with network splits, asymmetric partitions, and partial connectivity loss between services.

✓ Live

Infrastructure Faults — CPU, Memory, Disk & Node Failures

Learn infrastructure chaos engineering: simulating CPU exhaustion, memory pressure, disk fill, IO throttling, and node failures to test infrastructure resilience.

✓ Live

Kubernetes Chaos — Pod Failures, DNS Issues & Resource Pressure

Learn Kubernetes chaos engineering: injecting pod failures, DNS resolution errors, node resource pressure, and container-level faults to test cluster resilience.

✓ Live

Chaos Engineering Pipeline — Automating Experiments in CI/CD

Learn how to build a chaos engineering pipeline: automating experiments in CI/CD with GitOps, gating deployments on resilience tests, and measuring reliability metrics.

✓ Live

Designing Chaos Experiments — Structured Fault Injection for Resilient Systems

Learn how to design chaos experiments for production systems: hypothesis formulation, fault selection, blast radius planning, execution workflows, and result analysis with real YAML and Python examples.

✓ Live

Chaos Mesh on Kubernetes — Practical Fault Injection Guide

Learn Chaos Mesh for Kubernetes chaos engineering: installation, fault types, pod-kill and network latency experiments, scheduled chaos, and dashboard monitoring with YAML and bash examples.

✓ Live

LitmusChaos Guide — Cloud-Native Chaos Engineering for Kubernetes

Learn LitmusChaos for Kubernetes chaos engineering: installation, ChaosHub experiments, workflow orchestration, GitOps integration, resilience scoring, and automated pipeline testing.

✓ Live

Gremlin Platform — Managed Chaos Engineering for Production Systems

Learn Gremlin chaos engineering platform: attack types, safe execution modes, API-driven experiments, and integrating Gremlin into CI/CD pipelines for production resilience testing.

✓ Live

AWS Chaos Engineering — Fault Injection Service for Cloud Workloads

Learn AWS Fault Injection Service (FIS): creating experiment templates, targeting EC2, ECS, EKS, and RDS, setting CloudWatch stop conditions, and automating chaos engineering on AWS.

✓ Live

Azure Chaos Studio Guide — Managed Fault Injection for Azure Resources

Learn Azure Chaos Studio: enabling targets, creating experiments with agent-based and agentless faults, RBAC configuration, Azure Monitor safety guards, and AKS chaos testing.

✓ Live

Advanced Chaos Experiments — Multi-Fault & Orchestrated Testing

Learn advanced chaos experiment design: multi-fault injection, orchestrated scenarios, failure chains, and automated hypothesis validation for production-grade resilience testing.

✓ Live

Chaos Mesh Advanced — Workflows, Schedules & Custom Faults

Learn advanced Chaos Mesh features: workflow orchestration, scheduled experiments, custom fault types, Chaos Dashboard, and multi-cluster chaos engineering.

✓ Live

LitmusChaos Advanced — Workflows, GitOps & Resilience Scores

Learn advanced LitmusChaos features: complex workflow orchestration, GitOps integration with ArgoCD, automated resilience scoring, custom probes, and chaos schedules.

✓ Live

Gremlin Advanced — Scenarios, Containers & API Automation

Learn advanced Gremlin features: multi-step scenarios, container attacks, API-driven automation, team management, and integration with monitoring systems.

✓ Live

AWS Chaos Pipeline — Automated FIS Experiments with CI/CD

Learn how to build an automated AWS chaos engineering pipeline using AWS FIS, EventBridge, Lambda, and CI/CD integration for continuous resilience validation.

✓ Live

Azure Chaos Pipeline — Automated Experiments with DevOps

Learn how to build an automated Azure chaos engineering pipeline using Azure Chaos Studio, DevOps integration, ARM templates, and automated safety guardrails.

✓ Live

Kubernetes Chaos Testing — Pod, Node & Cluster Resilience

Learn comprehensive Kubernetes chaos testing strategies: pod-level faults, node disruptions, cluster API failures, RBAC testing, and etcd quorum validation.

✓ Live

Database Chaos Engineering — PostgreSQL, MySQL & Redis Resilience

Learn database chaos engineering techniques for PostgreSQL, MySQL, and Redis: connection pool exhaustion, replication lag, failover testing, cache evictions, and data consistency validation.

✓ Live

Network Chaos Testing — Latency, Packet Loss & Bandwidth Limits

Learn network chaos testing techniques: latency injection, packet loss simulation, bandwidth throttling, DNS manipulation, and asymmetric network failures for distributed system resilience.

✓ Live

Observability in Chaos Engineering — Metrics, Traces & Logs

Learn how to observe chaos experiments with Prometheus metrics, distributed tracing, structured logging, and custom dashboards for experiment impact analysis.

✓ Live

Chaos Terminology — Complete Guide

Learn essential chaos engineering terminology like blast radius steady state hypotheses and fault injection to build a strong foundation for resilience testing.

✓ Live

Attack Surface — Complete Guide

Learn to identify and map system attack surfaces for chaos experiments covering API endpoints service dependencies data flows and infrastructure touch points.

✓ Live

Failure Modes — Complete Guide

Learn to classify and analyze system failure modes with fault trees and FMEA methodology to prioritize chaos experiments for maximum resilience impact.

✓ Live

Fault Injection — Complete Guide

Learn fault injection techniques including code infrastructure and network faults to simulate real-world failures and validate system resilience across layers.

✓ Live

Fault Domains — Complete Guide

Learn to identify and isolate fault domains in distributed systems to limit blast radius prevent cascading failures and design chaos experiment boundaries.

✓ Live

Chaos in Production — Complete Guide

Learn best practices for production chaos experiments including guardrails rollback procedures and progressive rollout strategies for safe resilience testing.

✓ Live

Friday Afternoon Testing — Complete Guide

Learn the Friday afternoon testing methodology for chaos engineering scheduling safe experiments before weekends to find issues without production impact.

✓ Live

Chaos Engineering in Regulated Industries

Learn to implement chaos engineering programs in regulated industries while maintaining SOC2 HIPAA PCI-DSS and other compliance framework requirements.

✓ Live

SOC2 and Chaos Engineering — Complete Guide

Learn to integrate chaos engineering with SOC2 compliance programs using resilience testing evidence to meet availability and security control requirements.

✓ Live

FinOps and Chaos Engineering — Complete Guide

Learn to combine chaos engineering with FinOps practices to understand failure cost implications and optimize spending on resilience and disaster recovery.

✓ Live

Cost of Chaos — Complete Guide

Learn to analyze the financial impact of system failures with chaos engineering data to build business cases for resilience investments and capacity planning.

✓ Live

Resource Exhaustion Chaos — Complete Guide

Learn to simulate resource exhaustion scenarios with CPU memory disk and connection pool saturation to validate system behavior under extreme load conditions.

✓ Live

Kernel Chaos Engineering — Complete Guide

Learn kernel-level chaos engineering using eBPF and system call fault injection to test application resilience against operating system-level failures.

✓ Live

System Call Fault Injection — Complete Guide

Learn to inject system call faults using ptrace seccomp and eBPF to simulate OS failures and validate application error handling at the kernel interface.

✓ Live

Filesystem Chaos — Complete Guide

Learn filesystem chaos engineering including disk I-O failures permission errors inode exhaustion and read-only mounts to test application resilience.

✓ Live

I-O Stress Testing — Complete Guide

Learn I-O stress testing methods to simulate disk latency throttled throughput and IOPS exhaustion scenarios for validating behavior under storage pressure.

✓ Live

Time Chaos Engineering — Complete Guide

Learn time-based chaos engineering including clock skew time zone manipulation and leap second simulation to test time-dependent system behavior patterns.

✓ Live

Clock Skew Simulation — Complete Guide

Learn to simulate clock skew in distributed systems to test time synchronization dependencies authentication tokens and distributed consensus mechanisms.

✓ Live

NTP Failure Simulation — Complete Guide

Learn to simulate NTP server failures and time sync loss to validate application behavior when system clocks drift across distributed infrastructure nodes.

✓ Live

TLS Chaos Engineering — Complete Guide

Learn TLS chaos engineering including certificate expiration revoked certificates cipher mismatches and protocol downgrade to test secure communication.

✓ Live

mTLS Failure Injection — Complete Guide

Learn mTLS failure injection techniques to test service mesh and inter-service communication resilience when handshakes fail or certificates become invalid.

✓ Live

OAuth Chaos Engineering — Complete Guide

Learn OAuth chaos engineering by simulating provider failures and token expiration in Kubernetes to test authentication resilience and token refresh mechanisms.

✓ Live

Rate Limit Chaos — Complete Guide

Learn rate limit chaos engineering to test API gateway and service behavior under throttled conditions ensuring proper backpressure and client retry handling.

✓ Live

Resource Quota Chaos — Complete Guide

Learn to simulate resource quota exhaustion in Kubernetes namespaces to test application behavior when CPU memory or storage quotas are reached and enforced.

✓ Live

Dependency Chaos — Complete Guide

Learn dependency chaos engineering to test application resilience when upstream services databases or external APIs become unavailable or return errors.

✓ Live

Downstream Failure Testing — Complete Guide

Learn to simulate downstream service failures to test consumer resilience circuit breakers and fallback mechanisms in microservice architectures effectively.

✓ Live

Upstream Failure Testing — Complete Guide

Learn to test upstream service dependency failures with chaos engineering to validate graceful degradation caching strategies and error propagation controls.

✓ Live

Cascading Failure Prevention — Complete Guide

Learn to simulate and prevent cascading failures using chaos engineering circuit breakers bulkheads and load shedding strategies for distributed systems.

✓ Live

Circuit Breaker Chaos — Complete Guide

Learn to validate circuit breaker implementations with chaos engineering by simulating failures and measuring open half-open closed state transitions.

✓ Live

Retry Storm Prevention — Complete Guide

Learn retry storm prevention with chaos engineering techniques by simulating transient failures and measuring exponential backoff and jitter effectiveness.

✓ Live

Backpressure Chaos — Complete Guide

Learn backpressure chaos engineering to test system behavior under load validating flow control mechanisms and preventing overload in reactive systems.

✓ Live

Message Queue Chaos — Complete Guide

Learn message queue chaos engineering including broker failures partition elections consumer lag spikes and message loss scenarios for resilience testing.

✓ Live

Kafka Broker Failure — Complete Guide

Learn Kafka broker failure simulation to test producer retry logic consumer rebalancing partition availability and end-to-end message delivery guarantees.

✓ Live

ZooKeeper Failure Simulation — Complete Guide

Learn ZooKeeper failure scenarios including leader election quorum loss and session expiration to test distributed coordination and recovery procedures.

✓ Live

etcd Failure Injection — Complete Guide

Learn etcd failure injection including quorum loss compaction storms and network partitions to test Kubernetes and distributed system control plane resilience.

✓ Live

Consul Failure Simulation — Complete Guide

Learn Consul failure simulation including server outages service catalog corruption and health check failures to test service mesh and discovery resilience.

✓ Live

Service Discovery Failure — Complete Guide

Learn service discovery failure scenarios including DNS outages registry corruption and stale endpoints to test application resilience in dynamic environments.

✓ Live

Config Store Failure — Complete Guide

Learn config store failure injection for etcd Consul and ZooKeeper to test application behavior when dynamic configuration becomes unavailable or corrupt.

✓ Live

Chaos Engineering Automation — Complete Guide

Learn to automate chaos engineering experiments using CI-CD pipelines scheduled workflows and GitOps for continuous resilience validation across releases.

✓ Live

Schedulable Chaos Experiments — Complete Guide

Learn to schedule chaos experiments as cron jobs and Kubernetes CronJobs for regular resilience testing without manual intervention or operational overhead.

✓ Live

Chaos Workflow Orchestration — Complete Guide

Learn chaos engineering workflow orchestration using Argo Workflows Tekton or custom pipelines to run multi-step experiments with validation gate checks.

✓ Live

Rollback Chaos Testing — Complete Guide

Learn to test rollback procedures with chaos engineering by injecting failures during deployments to validate automated rollback triggers and consistency.

✓ Live

Deployment Chaos Engineering — Complete Guide

Learn deployment chaos engineering to test canary releases blue-green and rolling updates by injecting failures during transition phases and measuring impact.

✓ Live

Canary Chaos Engineering — Complete Guide

Learn canary chaos engineering to validate progressive delivery strategies by injecting failures into canary deployments and measuring observability signals.

✓ Live

Traffic Splitting Chaos — Complete Guide

Learn traffic splitting chaos to test service mesh and API gateway routing rules by injecting failures into specific traffic subsets and measuring impact.

✓ Live

Load Shedding Chaos — Complete Guide

Learn load shedding chaos engineering to validate graceful degradation and request prioritization when systems approach capacity limits under simulated stress.

✓ Live

Autoscaling Chaos Engineering — Complete Guide

Learn autoscaling chaos engineering to test HPA VPA and cluster autoscaler scaling behavior under simulated load spikes and infrastructure failure scenarios.

✓ Live

HPA Chaos Engineering — Complete Guide

Learn HPA chaos engineering to test scaling policies metric collection and pod scaling behavior under simulated CPU and memory pressure conditions effectively.

✓ Live

VPA Chaos Engineering — Complete Guide

Learn Vertical Pod Autoscaler chaos engineering to test resource recommendation accuracy and pod restarts under simulated workload variation scenarios.

✓ Live

Cluster Autoscaler Chaos — Complete Guide

Learn cluster autoscaler chaos engineering to test node pool scaling behavior under simulated resource constraints and pending pod scenarios across providers.

✓ Live

Spot Instance Chaos — Complete Guide

Learn spot instance chaos engineering to test application resilience against preemption notifications and instance termination events in cloud environments.

✓ Live

Region Failure Simulation — Complete Guide

Learn region failure simulation techniques to test disaster recovery cross-region failover and data replication under complete regional outages effectively.

✓ Live

Azure Availability Zone Failure — Complete Guide

Learn Azure availability zone failure simulation to test zone-redundant application resilience and data replication across isolated availability zones.

✓ Live

Multi-Region Chaos Engineering — Complete Guide

Learn multi-region chaos engineering to test global load balancing cross-region replication and failover strategies under regional degradation conditions.

✓ Live

Cross-Region Chaos Engineering — Complete Guide

Learn cross-region chaos experiments to validate active-active and active-passive architectures with realistic inter-region latency and failure injection.

✓ Live

All 91 topics in Chaos Engineering — Complete Guide are published.