Skip to content

Gremlin Advanced — Scenarios, Containers & API Automation

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Gremlin Advanced. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Gremlin advanced capabilities include multi-step scenario Orchestration, container-level attacks, API-driven automation, and team-based access control. Chaos Engineering teams use these features to build automated Resilience Testing into their platform operations.

What You Will Learn

This tutorial teaches you how to create complex Gremlin scenarios with conditional steps, run attacks inside Docker and Kubernetes containers, automate experiments using the Gremlin API, and manage team permissions for safe experiment execution.

Why It Matters

The Gremlin platform provides enterprise-grade controls for scaling Chaos Engineering across large organizations. Advanced features like scenarios, container attacks, and API automation enable teams to run hundreds of experiments per week without manual intervention while maintaining strict safety controls.

Real-World Use

DodaTech uses Gremlin scenarios to simulate a full "region degradation" event. The scenario runs 12 sequential attacks across three availability zones, testing whether the Durga Antivirus Pro update service can survive a partial region failure without serving stale virus definitions.

Prerequisites

Before starting you should understand:

  • Basic Gremlin agent installation and attack types
  • Chaos Engineering experiment design
  • Docker and Kubernetes container concepts
  • REST API fundamentals

Step 1: Create Advanced Scenarios

Scenarios with conditional steps enable complex failure simulations:

# advanced-scenario.yaml
gremlin_scenario:
  name: "Multi-Stage Database Degradation"
  description: "Simulates progressive database failure over 5 minutes"
  steps:
    - name: increase-latency
      type: latency
      targets:
        - host: db-primary
      args:
        latency: 100
        port: 5432
      length: 60
    - name: add-cpu-pressure
      type: cpu
      targets:
        - host: db-primary
      args:
        cpuPercent: 50
      length: 120
      depends_on:
        - increase-latency
    - name: block-traffic
      type: blackhole
      targets:
        - host: db-replica
      args:
        port: 5432
      length: 60
      depends_on:
        - add-cpu-pressure
# Create the scenario from a JSON file
gremlin scenario create \
  --name "Multi-Stage Database Degradation" \
  --description "Simulates progressive database failure" \
  --steps-file advanced-scenario.json

# Expected output:
# ✅ Scenario created: scenario-abc-123
# Steps:
#   1. increase-latency (60s)
#   2. add-cpu-pressure (120s) - wait for step 1
#   3. block-traffic (60s) - wait for step 2
# Total duration: 4 minutes

# Run the scenario
gremlin scenario run --scenario-id scenario-abc-123
# Expected output:
# ✅ Scenario 'Multi-Stage Database Degradation' is now running
# Step 1/3: increase-latency (running)

Step 2: Run Container Attacks

Gremlin supports attacking containers directly without host-level agents:

# List running containers on a target host
gremlin containers
# Expected output:
# CONTAINER ID   IMAGE                    STATUS   HOST
# a1b2c3d4e5f6   dodatech/web-service     running  app-server-1
# f6e5d4c3b2a1   dodatech/auth-service    running  app-server-1
# x7y8z9a0b1c2   postgres:16              running  db-server-1

# Kill a specific container
gremlin attack container \
  --container-id a1b2c3d4e5f6 \
  --length 30

# Expected output:
# ✅ Container attack created
# Container: a1b2c3d4e5f6 (dodatech/web-service)
# Action: kill
# Duration: 30 seconds

# Verify the container restarts
docker ps --filter id=a1b2c3d4e5f6
# Expected output:
# CONTAINER ID   IMAGE                    STATUS              
# a1b2c3d4e5f6   dodatech/web-service     Up 2 seconds (just restarted)

Step 3: Automate with the Gremlin API

Use the Gremlin API to programmatically create and run attacks:

#!/usr/bin/env python3
"""Automated chaos experiment using Gremlin API."""
import requests
import json
import os
import time

API_BASE = "https://api.gremlin.com/v1"
API_KEY = os.environ["GREMLIN_API_KEY"]
TEAM_ID = os.environ["GREMLIN_TEAM_ID"]

headers = {
    "Authorization": f"Key {API_KEY}",
    "Content-Type": "application/json"
}

def run_cpu_attack(host, cpu_percent, duration_seconds):
    payload = {
        "target": {
            "type": "host",
            "exact": [host]
        },
        "attackType": "cpu",
        "args": {
            "cpuPercent": cpu_percent,
            "duration": duration_seconds
        }
    }

    response = requests.post(
        f"{API_BASE}/attacks/new",
        headers=headers,
        json=payload
    )
    return response.json()

def run_latency_attack(host, latency_ms, port, duration_seconds):
    payload = {
        "target": {
            "type": "host",
            "exact": [host]
        },
        "attackType": "latency",
        "args": {
            "latency": latency_ms,
            "port": port,
            "duration": duration_seconds
        }
    }

    response = requests.post(
        f"{API_BASE}/attacks/new",
        headers=headers,
        json=payload
    )
    return response.json()

# Execute a sequential experiment
print("Starting CPU attack on app-server-1...")
result = run_cpu_attack("app-server-1", 80, 60)
print(f"Attack ID: {result.get('attackId')}")
time.sleep(60)

print("Starting latency attack on app-server-1...")
result = run_latency_attack("app-server-1", 300, 8080, 60)
print(f"Attack ID: {result.get('attackId')}")
time.sleep(60)

print("Experiments completed successfully.")

# Expected output:
# Starting CPU attack on app-server-1...
# Attack ID: cpu-attack-abc-001
# Starting latency attack on app-server-1...
# Attack ID: lat-attack-abc-002
# Experiments completed successfully.

Step 4: Configure Team Permissions and Safety Controls

Set up role-based access for different teams:

# Create a team with restricted blast radius
gremlin team create \
  --name "backend-team" \
  --max-attack-targets 3 \
  --max-attack-duration 300 \
  --allowed-attacks "cpu,latency,process-kill"

# Expected output:
# ✅ Team 'backend-team' created
# Restrictions:
#   - Max targets: 3
#   - Max duration: 300 seconds
#   - Allowed attacks: cpu, latency, process-kill

# Add a halt-on-alert integration
gremlin integration create \
  --type pagerduty \
  --name "Chaos Alerts" \
  --routing-key "chaos-engineering"

# Expected output:
# ✅ PagerDuty integration 'Chaos Alerts' created
# Halt-on-alert will stop all active attacks when a PagerDuty incident is triggered.

Learning Path

flowchart LR
  A[Gremlin Basics] --> B[Gremlin Advanced]
  B --> C[AWS Chaos Pipeline]
  C --> D[Azure Chaos Pipeline]
  D --> E[Chaos Observability]
  style B fill:#f90,color:#fff

Common Errors

  1. Scenario steps with insufficient duration for recovery: If step 1 causes a pod restart and step 2 starts before the pod is ready, the combined effect may be more severe than intended.
  2. Container attacks on ephemeral containers: Containers that restart immediately (like those managed by Kubernetes) may restart before the attack duration expires, causing unexpected behavior.
  3. API rate limits on automated experiments: The Gremlin API has rate limits. Batch experiments with delays to avoid hitting the limit.
  4. Team permission misconfiguration: If a team is restricted to 3 targets but a scenario targets 5, the scenario will fail. Align scenario design with team permissions.
  5. Not testing halt-on-alert integrations: Configure a test alert to verify that halt-on-alert actually stops running attacks. This is one integration you should test regularly.

Practice Questions

  1. How do Gremlin scenarios differ from single attacks?
  2. What is the difference between host-level and container-level attacks in Gremlin?
  3. How do you authenticate with the Gremlin API for automation?
  4. What team-level safety controls does Gremlin provide?
  5. How does halt-on-alert integration work?

Challenge

Build a Python script that uses the Gremlin API to run a weekly "deployment validation" experiment. The script should: authenticate with the API, run a sequence of three attacks (CPU 50 percent for 30s, latency 200ms for 60s, process kill for 10s) against a staging server, wait for all attacks to complete, check the experiment status, and report pass or fail based on whether any halts were triggered.

FAQ

What is a Gremlin scenario?

A scenario is a sequence of attack steps that run in order, where each step can depend on the previous step completing, enabling complex failure simulations.

Can Gremlin attack Docker containers?

Yes. Gremlin can attack individual containers by container ID, including stopping, killing, or injecting network faults into containers without affecting the host.

How do I automate chaos experiments with Gremlin?

Use the Gremlin REST API with API key authentication. The API supports creating, listing, and stopping attacks and scenarios programmatically.

What team permissions does Gremlin support?

Gremlin supports team-level restrictions on maximum targets, maximum attack duration, and allowed attack types, plus halt-on-alert integrations with monitoring tools.

Can Gremlin halt attacks automatically?

Yes. Gremlin integrates with PagerDuty, Datadog, and other monitoring platforms via halt-on-alert, which stops all active attacks when an alert fires.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro