Gremlin Advanced — Scenarios, Containers & API Automation
In this tutorial, you'll learn about Gremlin Advanced. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Gremlin advanced capabilities include multi-step scenario Orchestration, container-level attacks, API-driven automation, and team-based access control. Chaos Engineering teams use these features to build automated Resilience Testing into their platform operations.
What You Will Learn
This tutorial teaches you how to create complex Gremlin scenarios with conditional steps, run attacks inside Docker and Kubernetes containers, automate experiments using the Gremlin API, and manage team permissions for safe experiment execution.
Why It Matters
The Gremlin platform provides enterprise-grade controls for scaling Chaos Engineering across large organizations. Advanced features like scenarios, container attacks, and API automation enable teams to run hundreds of experiments per week without manual intervention while maintaining strict safety controls.
Real-World Use
DodaTech uses Gremlin scenarios to simulate a full "region degradation" event. The scenario runs 12 sequential attacks across three availability zones, testing whether the Durga Antivirus Pro update service can survive a partial region failure without serving stale virus definitions.
Prerequisites
Before starting you should understand:
- Basic Gremlin agent installation and attack types
- Chaos Engineering experiment design
- Docker and Kubernetes container concepts
- REST API fundamentals
Step 1: Create Advanced Scenarios
Scenarios with conditional steps enable complex failure simulations:
# advanced-scenario.yaml
gremlin_scenario:
name: "Multi-Stage Database Degradation"
description: "Simulates progressive database failure over 5 minutes"
steps:
- name: increase-latency
type: latency
targets:
- host: db-primary
args:
latency: 100
port: 5432
length: 60
- name: add-cpu-pressure
type: cpu
targets:
- host: db-primary
args:
cpuPercent: 50
length: 120
depends_on:
- increase-latency
- name: block-traffic
type: blackhole
targets:
- host: db-replica
args:
port: 5432
length: 60
depends_on:
- add-cpu-pressure
# Create the scenario from a JSON file
gremlin scenario create \
--name "Multi-Stage Database Degradation" \
--description "Simulates progressive database failure" \
--steps-file advanced-scenario.json
# Expected output:
# ✅ Scenario created: scenario-abc-123
# Steps:
# 1. increase-latency (60s)
# 2. add-cpu-pressure (120s) - wait for step 1
# 3. block-traffic (60s) - wait for step 2
# Total duration: 4 minutes
# Run the scenario
gremlin scenario run --scenario-id scenario-abc-123
# Expected output:
# ✅ Scenario 'Multi-Stage Database Degradation' is now running
# Step 1/3: increase-latency (running)
Step 2: Run Container Attacks
Gremlin supports attacking containers directly without host-level agents:
# List running containers on a target host
gremlin containers
# Expected output:
# CONTAINER ID IMAGE STATUS HOST
# a1b2c3d4e5f6 dodatech/web-service running app-server-1
# f6e5d4c3b2a1 dodatech/auth-service running app-server-1
# x7y8z9a0b1c2 postgres:16 running db-server-1
# Kill a specific container
gremlin attack container \
--container-id a1b2c3d4e5f6 \
--length 30
# Expected output:
# ✅ Container attack created
# Container: a1b2c3d4e5f6 (dodatech/web-service)
# Action: kill
# Duration: 30 seconds
# Verify the container restarts
docker ps --filter id=a1b2c3d4e5f6
# Expected output:
# CONTAINER ID IMAGE STATUS
# a1b2c3d4e5f6 dodatech/web-service Up 2 seconds (just restarted)
Step 3: Automate with the Gremlin API
Use the Gremlin API to programmatically create and run attacks:
#!/usr/bin/env python3
"""Automated chaos experiment using Gremlin API."""
import requests
import json
import os
import time
API_BASE = "https://api.gremlin.com/v1"
API_KEY = os.environ["GREMLIN_API_KEY"]
TEAM_ID = os.environ["GREMLIN_TEAM_ID"]
headers = {
"Authorization": f"Key {API_KEY}",
"Content-Type": "application/json"
}
def run_cpu_attack(host, cpu_percent, duration_seconds):
payload = {
"target": {
"type": "host",
"exact": [host]
},
"attackType": "cpu",
"args": {
"cpuPercent": cpu_percent,
"duration": duration_seconds
}
}
response = requests.post(
f"{API_BASE}/attacks/new",
headers=headers,
json=payload
)
return response.json()
def run_latency_attack(host, latency_ms, port, duration_seconds):
payload = {
"target": {
"type": "host",
"exact": [host]
},
"attackType": "latency",
"args": {
"latency": latency_ms,
"port": port,
"duration": duration_seconds
}
}
response = requests.post(
f"{API_BASE}/attacks/new",
headers=headers,
json=payload
)
return response.json()
# Execute a sequential experiment
print("Starting CPU attack on app-server-1...")
result = run_cpu_attack("app-server-1", 80, 60)
print(f"Attack ID: {result.get('attackId')}")
time.sleep(60)
print("Starting latency attack on app-server-1...")
result = run_latency_attack("app-server-1", 300, 8080, 60)
print(f"Attack ID: {result.get('attackId')}")
time.sleep(60)
print("Experiments completed successfully.")
# Expected output:
# Starting CPU attack on app-server-1...
# Attack ID: cpu-attack-abc-001
# Starting latency attack on app-server-1...
# Attack ID: lat-attack-abc-002
# Experiments completed successfully.
Step 4: Configure Team Permissions and Safety Controls
Set up role-based access for different teams:
# Create a team with restricted blast radius
gremlin team create \
--name "backend-team" \
--max-attack-targets 3 \
--max-attack-duration 300 \
--allowed-attacks "cpu,latency,process-kill"
# Expected output:
# ✅ Team 'backend-team' created
# Restrictions:
# - Max targets: 3
# - Max duration: 300 seconds
# - Allowed attacks: cpu, latency, process-kill
# Add a halt-on-alert integration
gremlin integration create \
--type pagerduty \
--name "Chaos Alerts" \
--routing-key "chaos-engineering"
# Expected output:
# ✅ PagerDuty integration 'Chaos Alerts' created
# Halt-on-alert will stop all active attacks when a PagerDuty incident is triggered.
Learning Path
flowchart LR A[Gremlin Basics] --> B[Gremlin Advanced] B --> C[AWS Chaos Pipeline] C --> D[Azure Chaos Pipeline] D --> E[Chaos Observability] style B fill:#f90,color:#fff
Common Errors
- Scenario steps with insufficient duration for recovery: If step 1 causes a pod restart and step 2 starts before the pod is ready, the combined effect may be more severe than intended.
- Container attacks on ephemeral containers: Containers that restart immediately (like those managed by Kubernetes) may restart before the attack duration expires, causing unexpected behavior.
- API rate limits on automated experiments: The Gremlin API has rate limits. Batch experiments with delays to avoid hitting the limit.
- Team permission misconfiguration: If a team is restricted to 3 targets but a scenario targets 5, the scenario will fail. Align scenario design with team permissions.
- Not testing halt-on-alert integrations: Configure a test alert to verify that halt-on-alert actually stops running attacks. This is one integration you should test regularly.
Practice Questions
- How do Gremlin scenarios differ from single attacks?
- What is the difference between host-level and container-level attacks in Gremlin?
- How do you authenticate with the Gremlin API for automation?
- What team-level safety controls does Gremlin provide?
- How does halt-on-alert integration work?
Challenge
Build a Python script that uses the Gremlin API to run a weekly "deployment validation" experiment. The script should: authenticate with the API, run a sequence of three attacks (CPU 50 percent for 30s, latency 200ms for 60s, process kill for 10s) against a staging server, wait for all attacks to complete, check the experiment status, and report pass or fail based on whether any halts were triggered.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro