Gremlin Platform — Managed Chaos Engineering Service
In this tutorial, you'll learn about Gremlin Platform. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Gremlin is a managed Chaos Engineering platform that provides safe, controlled failure injection for both Kubernetes and traditional infrastructure. Gremlin offers a web console, CLI, and API for running experiments with built-in safety controls.
What You Will Learn
This tutorial teaches you how to use the Gremlin platform to inject faults, create scenarios, and run experiments with automated guardrails and team collaboration.
Why It Matters
Gremlin abstracts away the complexity of building your own Chaos Engineering infrastructure. It provides a curated set of attack types — from CPU exhaustion to blackhole networking — with safety controls that prevent experiments from going out of control.
Real-World Use
DodaTech uses Gremlin for chaos experiments on legacy infrastructure that does not run on Kubernetes. The Gremlin agent supports Linux and Windows hosts, making it possible to run chaos experiments on bare metal servers running Durga Antivirus Pro scanning nodes.
Prerequisites
Before starting you should understand:
- Chaos Engineering concepts (Steady State, hypothesis, Blast Radius)
- How to install and configure software agents on Linux
- Basic networking concepts (latency, packet loss, bandwidth)
Step 1: Install the Gremlin Agent
Sign up for a Gremlin account and install the agent on your target infrastructure:
# Install Gremlin agent on Ubuntu/Debian
curl -fsSL https://www.gremlin.com/install/ubuntu.sh | sudo bash
sudo systemctl start gremlind
sudo systemctl enable gremlind
# Verify agent is running
sudo gremlin status
# Expected output:
# Gremlin daemon is running
# Client ID: abc123-def456
# Team ID: team-789
Step 2: Authenticate and Create a Team
Configure the agent with your team credentials:
# Authenticate with Gremlin
gremlin login --team admin-team --password ********
# Expected output:
# ✅ Successfully authenticated
# Welcome to Gremlin
# List available attack types
gremlin help attacks
# Expected output:
# CPU - Consume CPU resources
# Memory - Consume memory resources
# Blackhole - Drop all network traffic
# Latency - Add network latency
# Packet Loss - Drop network packets
# DNS - Block or modify DNS responses
# Process Kill - Terminate a specific process
# Shutdown - Shutdown or reboot the host
Step 3: Run a CPU Attack
The simplest attack consumes CPU resources on a target host:
# Saturate 1 CPU core for 60 seconds
gremlin attack cpu \
--length 60 \
--target 1
# Expected output:
# ✅ CPU attack created
# Attack ID: cpu-attack-001
# Type: CPU
# Targets: 1 host
# Duration: 60 seconds
Monitor the effect on the target:
# On the target host check CPU usage
top -bn1 | head -5
# Expected output:
# %Cpu(s): 100.0 us, 0.0 sy, 0.0 ni
Step 4: Run a Network Latency Attack
Simulate network delays between services:
# Add 200ms latency to port 8080 for 120 seconds
gremlin attack latency \
--length 120 \
--target 1 \
--port 8080 \
--latency 200
# Expected output:
# ✅ Latency attack created
# Attack ID: lat-attack-002
# Type: Latency
# Latency: 200ms
# Duration: 120 seconds
Verify the latency is applied:
# From another host measure the latency
ping target-host
# Expected output:
# rtt min/avg/max/mdev = 200.342/201.123/202.456/0.567 ms
Step 5: Create a Scenario with Multiple Attacks
Scenarios chain multiple attacks together for complex failure simulations:
# Create a scenario that simulates a degraded database
gremlin scenario create \
--name "Degraded Database" \
--description "Simulates a database experiencing resource pressure"
# Then add steps:
gremlin scenario step add \
--scenario-id "scenario-001" \
--attack cpu --length 120 --target db-host
gremlin scenario step add \
--scenario-id "scenario-001" \
--attack latency --length 120 --target db-host --port 5432 --latency 100
gremlin scenario run --scenario-id "scenario-001"
# Expected output:
# ✅ Scenario 'Degraded Database' is now running
Learning Path
flowchart LR A[LitmusChaos] --> B[Gremlin Platform] B --> C[AWS Fault Injection] C --> D[Azure Chaos Studio] D --> E[Latency Injection] style B fill:#f90,color:#fff
Common Errors
- Running attacks without Gremlin daemon running: The gremlind daemon must be active on the target host. Check with
sudo <a href="/devops/chaos-engineering/">gremlin</a> status. - Not setting a duration on attacks: Without a length parameter attacks run until manually stopped. Always set a duration.
- Using incorrect port numbers for latency attacks: When targeting specific ports ensure the service is actually listening on that port.
- Overlapping attacks on the same host: Multiple concurrent attacks on a single host can produce unpredictable results.
- Forgetting to halt the attack when done: Use
<a href="/devops/chaos-engineering/">gremlin</a> halt <attack-id>to stop an attack early.
Practice Questions
- What attack types does Gremlin support for network chaos?
- How do you install the Gremlin agent on a Linux host?
- What is a Gremlin scenario and how does it differ from a single attack?
- How do you verify that a latency attack is active?
- How do you stop a running Gremlin attack manually?
Challenge
Create a Gremlin scenario that simulates a three-stage failure: first inject 50 percent CPU on the application server, then after 60 seconds add 300ms latency to the database connection, and finally after another 60 seconds kill the Process on a cache server. Run the scenario and document the system behavior at each stage.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro