Gremlin Platform — Managed Chaos Engineering for Production Systems
In this tutorial, you'll learn about Gremlin Platform. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Gremlin is a managed Chaos Engineering platform that provides a comprehensive set of Fault Injection capabilities for production and staging environments. It offers CPU, memory, disk, network, and stateful fault attacks with built-in safety controls that make Chaos Engineering accessible to teams without deep infrastructure expertise.
What You Will Learn
This tutorial teaches you how to use the Gremlin platform to run safe chaos experiments, create custom attack scenarios, use the Gremlin API for programmatic experiments, and integrate with CI/CD pipelines.
Why It Matters
Gremlin removes the operational overhead of running Chaos Engineering tools. It provides a web UI, CLI, and API with built-in safety features such as halt conditions, team permissions, and Blast Radius controls. Teams can start running experiments in minutes without managing infrastructure.
Real-World Use
DodaTech uses Gremlin to run weekly chaos experiments on the backend services powering DodaZIP. The Gremlin API is integrated into the deployment pipeline so every release automatically triggers a set of resilience checks before promotion to production.
Prerequisites
Before starting you should understand:
- Chaos Engineering core concepts and experiment design
- Basic Linux administration and command-line skills
- Docker for local testing with Gremlin
- CI/CD pipeline fundamentals
Step 1: Install the Gremlin Agent
The Gremlin agent runs on each target host and executes attacks. Install it using the Gremlin CLI or package manager.
# Install Gremlin client
curl -sSL https://cli.gremlin.com/install.sh | sudo bash
# Authenticate with your team credentials
gremlin login --team "your-team-id" --password "your-api-key"
# Verify installation
gremlin status
Expected output:
Gremlin Client: 3.2.1
API Endpoint: api.gremlin.com
Authenticated: yes
Team: dodatech
Agent: not installed on this host
Step 2: Run a CPU Attack
Gremlin CPU attacks consume a specified percentage of CPU cores. This is a safe starting experiment.
# Run a CPU attack consuming 50% of one core for 60 seconds
gremlin attack cpu \
--length 60 \
--cores 1 \
--percent 50
# Expected output:
# Attack: cpu-abc123def456
# Status: running (60s remaining)
# Monitor CPU usage on the target
top -b -n1 | head -10
# Expected output showing high CPU:
# PID USER %CPU COMMAND
# 12345 root 50.0 gremlin-cpu-hog
Step 3: Create a Network Latency Attack
Simulate network delays between services to test timeout and retry behavior.
# gremlin-network-attack.yaml
apiVersion: v1
kind: attack
target:
type: host
labels:
service: payment-api
attack:
type: network
command: latency
parameters:
latency: 500
jitter: 50
length: 120
# Apply the network attack using the Gremlin API
curl -s -X POST https://api.gremlin.com/v1/attacks \
-H "Authorization: Bearer $(gremlin api-key)" \
-H "Content-Type: application/json" \
-d @gremlin-network-attack.yaml
# Expected output:
# {"attackId":"net-xyz789","status":"running","duration":120}
Step 4: Set Halt Conditions
Halt conditions automatically stop attacks when defined metrics exceed thresholds.
# Create an attack with a halt condition
cat <<'EOF' | gremlin attack create
{
"target": {"type": "host", "labels": {"app": "web"}},
"attack": {
"type": "cpu",
"command": "cpu",
"parameters": {"cores": 2, "percent": 80, "length": 300}
},
"haltConditions": [
{
"type": "metric",
"metricName": "cpu_idle",
"operator": "lt",
"value": 10
}
]
}
EOF
Expected output:
Attack: cpu-halt-001
Halt condition: cpu_idle < 10%
Auto-stop: enabled
Step 5: Integrate with CI/CD Using the API
Trigger chaos experiments programmatically as part of your deployment pipeline.
import os
import requests
api_key = os.environ["GREMLIN_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}
attack = {
"target": {"type": "host", "labels": {"app": "payment-service"}},
"attack": {
"type": "network",
"command": "latency",
"parameters": {"latency": 200, "length": 60}
}
}
resp = requests.post(
"https://api.gremlin.com/v1/attacks",
json=attack,
headers=headers
)
if resp.status_code == 201:
attack_id = resp.json()["attackId"]
print(f"Attack started: {attack_id}")
print(f"Duration: 60 seconds")
else:
print(f"Failed to start attack: {resp.text}")
Expected output:
Attack started: net-abc123
Duration: 60 seconds
Learning Path
flowchart LR A[LitmusChaos] --> B[Gremlin] B --> C[AWS Fault Injection] C --> D[Azure Chaos Studio] D --> E[Fault Injection Proxy] style B fill:#f90,color:#fff
Common Errors
- Running attacks without halt conditions on production hosts: Always configure at least one halt condition based on CPU, memory, or error rate metrics before running production experiments.
- Attacking hosts without verifying the Gremlin agent is running: The agent must be online and authenticated. Check
<a href="/devops/chaos-engineering/">Gremlin</a> statusbefore creating attacks. - Using overly broad target labels: Specific labels prevent accidental attacks on the wrong hosts. Use
app: my-servicenotenv: production. - Setting attack durations too long without halt conditions: Long attacks without safety nets can cause cascading failures. Start with 30-60 second durations.
- Forgetting to clean up after API-based experiments: Track attack IDs and have a rollback script ready. Use
<a href="/devops/chaos-engineering/">Gremlin</a> attack stop <id>to terminate early.
Practice Questions
- What safety controls does the Gremlin platform provide for production experiments?
- How do you target a specific host or group of hosts in a Gremlin attack?
- What is the purpose of halt conditions and how do you configure them?
- How do you run a Gremlin attack programmatically using the API?
- What network attack types are available in the Gremlin platform?
Challenge
Create a Gremlin experiment that injects 300ms of latency into the payment-service host group for 90 seconds. Configure a halt condition that stops the attack if the CPU idle metric drops below 15 percent. Use the Gremlin API to start the attack, monitor it programmatically, and verify it stops correctly.
FAQ
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro