Skip to content

Gremlin Platform — Managed Chaos Engineering for Production Systems

DodaTech Updated 2026-06-23 5 min read

In this tutorial, you'll learn about Gremlin Platform. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Gremlin is a managed Chaos Engineering platform that provides a comprehensive set of Fault Injection capabilities for production and staging environments. It offers CPU, memory, disk, network, and stateful fault attacks with built-in safety controls that make Chaos Engineering accessible to teams without deep infrastructure expertise.

What You Will Learn

This tutorial teaches you how to use the Gremlin platform to run safe chaos experiments, create custom attack scenarios, use the Gremlin API for programmatic experiments, and integrate with CI/CD pipelines.

Why It Matters

Gremlin removes the operational overhead of running Chaos Engineering tools. It provides a web UI, CLI, and API with built-in safety features such as halt conditions, team permissions, and Blast Radius controls. Teams can start running experiments in minutes without managing infrastructure.

Real-World Use

DodaTech uses Gremlin to run weekly chaos experiments on the backend services powering DodaZIP. The Gremlin API is integrated into the deployment pipeline so every release automatically triggers a set of resilience checks before promotion to production.

Prerequisites

Before starting you should understand:

  • Chaos Engineering core concepts and experiment design
  • Basic Linux administration and command-line skills
  • Docker for local testing with Gremlin
  • CI/CD pipeline fundamentals

Step 1: Install the Gremlin Agent

The Gremlin agent runs on each target host and executes attacks. Install it using the Gremlin CLI or package manager.

# Install Gremlin client
curl -sSL https://cli.gremlin.com/install.sh | sudo bash

# Authenticate with your team credentials
gremlin login --team "your-team-id" --password "your-api-key"

# Verify installation
gremlin status

Expected output:

Gremlin Client: 3.2.1
API Endpoint: api.gremlin.com
Authenticated: yes
Team: dodatech
Agent: not installed on this host

Step 2: Run a CPU Attack

Gremlin CPU attacks consume a specified percentage of CPU cores. This is a safe starting experiment.

# Run a CPU attack consuming 50% of one core for 60 seconds
gremlin attack cpu \
  --length 60 \
  --cores 1 \
  --percent 50

# Expected output:
# Attack: cpu-abc123def456
# Status: running (60s remaining)
# Monitor CPU usage on the target
top -b -n1 | head -10

# Expected output showing high CPU:
# PID   USER    %CPU  COMMAND
# 12345 root    50.0  gremlin-cpu-hog

Step 3: Create a Network Latency Attack

Simulate network delays between services to test timeout and retry behavior.

# gremlin-network-attack.yaml
apiVersion: v1
kind: attack
target:
  type: host
  labels:
    service: payment-api
attack:
  type: network
  command: latency
  parameters:
    latency: 500
    jitter: 50
    length: 120
# Apply the network attack using the Gremlin API
curl -s -X POST https://api.gremlin.com/v1/attacks \
  -H "Authorization: Bearer $(gremlin api-key)" \
  -H "Content-Type: application/json" \
  -d @gremlin-network-attack.yaml

# Expected output:
# {"attackId":"net-xyz789","status":"running","duration":120}

Step 4: Set Halt Conditions

Halt conditions automatically stop attacks when defined metrics exceed thresholds.

# Create an attack with a halt condition
cat <<'EOF' | gremlin attack create
{
  "target": {"type": "host", "labels": {"app": "web"}},
  "attack": {
    "type": "cpu",
    "command": "cpu",
    "parameters": {"cores": 2, "percent": 80, "length": 300}
  },
  "haltConditions": [
    {
      "type": "metric",
      "metricName": "cpu_idle",
      "operator": "lt",
      "value": 10
    }
  ]
}
EOF

Expected output:

Attack: cpu-halt-001
Halt condition: cpu_idle < 10%
Auto-stop: enabled

Step 5: Integrate with CI/CD Using the API

Trigger chaos experiments programmatically as part of your deployment pipeline.

import os
import requests

api_key = os.environ["GREMLIN_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}

attack = {
    "target": {"type": "host", "labels": {"app": "payment-service"}},
    "attack": {
        "type": "network",
        "command": "latency",
        "parameters": {"latency": 200, "length": 60}
    }
}

resp = requests.post(
    "https://api.gremlin.com/v1/attacks",
    json=attack,
    headers=headers
)

if resp.status_code == 201:
    attack_id = resp.json()["attackId"]
    print(f"Attack started: {attack_id}")
    print(f"Duration: 60 seconds")
else:
    print(f"Failed to start attack: {resp.text}")

Expected output:

Attack started: net-abc123
Duration: 60 seconds

Learning Path

flowchart LR
  A[LitmusChaos] --> B[Gremlin]
  B --> C[AWS Fault Injection]
  C --> D[Azure Chaos Studio]
  D --> E[Fault Injection Proxy]
  style B fill:#f90,color:#fff

Common Errors

  1. Running attacks without halt conditions on production hosts: Always configure at least one halt condition based on CPU, memory, or error rate metrics before running production experiments.
  2. Attacking hosts without verifying the Gremlin agent is running: The agent must be online and authenticated. Check <a href="/devops/chaos-engineering/">Gremlin</a> status before creating attacks.
  3. Using overly broad target labels: Specific labels prevent accidental attacks on the wrong hosts. Use app: my-service not env: production.
  4. Setting attack durations too long without halt conditions: Long attacks without safety nets can cause cascading failures. Start with 30-60 second durations.
  5. Forgetting to clean up after API-based experiments: Track attack IDs and have a rollback script ready. Use <a href="/devops/chaos-engineering/">Gremlin</a> attack stop <id> to terminate early.

Practice Questions

  1. What safety controls does the Gremlin platform provide for production experiments?
  2. How do you target a specific host or group of hosts in a Gremlin attack?
  3. What is the purpose of halt conditions and how do you configure them?
  4. How do you run a Gremlin attack programmatically using the API?
  5. What network attack types are available in the Gremlin platform?

Challenge

Create a Gremlin experiment that injects 300ms of latency into the payment-service host group for 90 seconds. Configure a halt condition that stops the attack if the CPU idle metric drops below 15 percent. Use the Gremlin API to start the attack, monitor it programmatically, and verify it stops correctly.

FAQ

What is the Gremlin Chaos Engineering platform?

Gremlin is a managed Chaos Engineering platform offering CPU, memory, disk, network, and stateful fault attacks with built-in safety controls, a web UI, CLI, and REST API.

How does Gremlin ensure experiment safety?

Gremlin provides halt conditions based on metric thresholds, team-based permissions, Blast Radius controls, and the ability to stop any attack immediately from the UI or CLI.

Can Gremlin run experiments on containers and Kubernetes?

Yes. Gremlin supports Docker containers, Kubernetes pods, and EC2 instances through its agent-based architecture. Agents run on each target host.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro