Anomaly Detection in Metrics: Statistical and ML-Based Methods
In this tutorial, you'll learn about Anomaly Detection in Metrics: Statistical and ML. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You Will Learn
This tutorial teaches you how to detect anomalies in time-series metrics using statistical techniques (Z-score, moving averages, seasonal decomposition) and Machine Learning approaches (isolation forests, autoencoders), along with practical implementation using Prometheus and Grafana.
Why It Matters
Static thresholds cannot catch every problem. A gradual memory leak looks normal against a fixed threshold until the server crashes. Anomaly detection learns the normal pattern of your metrics and alerts you when behavior deviates, catching issues that static rules miss entirely.
Real-World Use
The DodaTech infrastructure team uses anomaly detection on disk I/O metrics. When a background backup process started running during business hours instead of midnight, the anomaly detection flagged the unusual daytime I/O pattern within 15 minutes -- before it could degrade production database performance.
Anomaly detection identifies data points that deviate significantly from the expected pattern. In time-series monitoring, anomalies can be: point anomalies (single spike), contextual anomalies (normal value but wrong time), or collective anomalies (a sequence of values that is unusual as a group).
Prerequisites
- A Prometheus instance with at least 30 days of metrics (see Prometheus Introduction)
- Python 3.8+ with scipy and numpy
- A Grafana Dashboards instance for visualization
- Basic statistics knowledge (mean, standard deviation)
Step-by-Step Tutorial
Step 1: Statistical Anomaly Detection with Z-Score
The Z-score measures how many standard deviations a data point is from the mean.
import numpy as np
from scipy import stats
def detect_anomalies_zscore(data, threshold=3):
z_scores = np.abs(stats.zscore(data))
return np.where(z_scores > threshold)[0]
# Example with CPU metric
cpu_usage = [45, 47, 46, 48, 44, 85, 46, 45, 47] # 85 is the anomaly
anomalies = detect_anomalies_zscore(cpu_usage)
print(f"Anomalous indices: {anomalies}") # [5]
Expected output: Index 5 (value 85) is flagged as anomalous.
Step 2: Moving Average with Standard Deviation Bands
import pandas as pd
import numpy as np
def detect_with_moving_avg(series, window=10, sigma=3):
rolling_mean = series.rolling(window=window).mean()
rolling_std = series.rolling(window=window).std()
upper = rolling_mean + (sigma * rolling_std)
lower = rolling_mean - (sigma * rolling_std)
anomalies = series[(series > upper) | (series < lower)]
return anomalies
# Load metrics from CSV
df = pd.read_csv("metrics.csv")
anomalies = detect_with_moving_avg(df["cpu_usage"])
print(f"Found {len(anomalies)} anomalies")
Step 3: Seasonal Decomposition for Recurring Patterns
from statsmodels.tsa.seasonal import seasonal_decompose
def seasonal_anomaly_detection(series, period=1440): # 1440 = daily (minute data)
result = seasonal_decompose(series, model="additive", period=period)
residual = result.resid
anomalies = np.abs(residual) > 3 * np.nanstd(residual)
return anomalies
Step 4: PromQL for Anomaly Detection
# Simple Z-score approximation in PromQL
(
node_memory_MemAvailable_bytes
-
avg_over_time(node_memory_MemAvailable_bytes[24h])
)
/
stddev_over_time(node_memory_MemAvailable_bytes[24h])
Expected output: Z-score values. Values above 3 or below -3 indicate anomalies.
Step 5: Use Grafana Machine Learning (Grafana ML)
Grafana ML is a plugin that provides anomaly detection.
- Install the Grafana ML plugin:
<a href="/devops/prometheus-grafana/">Grafana</a>-cli plugins install grafana-ml-app
- Enable the plugin in
<a href="/devops/prometheus-grafana/">Grafana</a>.ini:
[plugins]
enable_alpha = true
- In Grafana, go to ML > Anomaly Detection
- Select a time series query
- Set sensitivity (higher = more anomalies detected)
- Train the model (requires 7+ days of data)
- View anomaly timeline
Step 6: Implement Isolation Forest for Multivariate Detection
from sklearn.ensemble import IsolationForest
def isolation_forest_detection(metrics_df, contamination=0.01):
model = IsolationForest(contamination=contamination, random_state=42)
predictions = model.fit_predict(metrics_df)
anomalies = metrics_df[predictions == -1]
return anomalies
# Multiple metrics as features
features = df[["cpu_usage", "memory_usage", "disk_io", "network_in"]]
anomalies = isolation_forest_detection(features)
print(f"Found {len(anomalies)} multivariate anomalies")
Step 7: Deploy the Prometheus Anomaly Detector
docker run -d --name anomaly-detector \
-p 8080:8080 \
-e PROMETHEUS_URL=http://prometheus:9090 \
-e METRICS_QUERY='node_memory_MemAvailable_bytes' \
-e DETECTION_METHOD='zscore' \
-e ZSCORE_THRESHOLD=3 \
-e WINDOW_SIZE=24h \
bitnami/prometheus-anomaly-detector:latest
Step 8: Visualize Anomalies in Grafana
# Query for anomalies (values where Z-score > 3 or < -3)
(
(node_memory_MemAvailable_bytes - avg_over_time(node_memory_MemAvailable_bytes[24h]))
/
stddev_over_time(node_memory_MemAvailable_bytes[24h])
) > 3
In the Grafana panel, set:
- Threshold: Show regions between 3 and infinity as critical
- Threshold: Show regions between -infinity and -3 as critical
- Override color for the threshold region to red
Learning Path
flowchart LR
A[Metric Data] --> B[Statistical Methods]
A --> C[Machine Learning]
B --> D[Z-Score]
B --> E[Moving Average]
B --> F[Seasonal Decomposition]
C --> G[Isolation Forest]
C --> H[Autoencoder]
D --> I[Anomaly Alerts]
E --> I
F --> I
G --> I
H --> I
style A fill:#4a90d9,color:#fff
style I fill:#e67e22,color:#fff
Common Errors
Z-score misses gradual trends -- A memory leak causes a slow increase over days. The Z-score sees it as normal because the mean is also increasing. Use moving-average Z-score instead.
Seasonal decomposition fails with irregular periods -- The
periodparameter must match the data frequency. Use autocorrelation to find the dominant period.Isolation Forest flags normal patterns as anomalies -- The
contaminationparameter is too high. Start with 0.01 and adjust based on results.High false positive rate from static thresholds -- The anomaly detection does not account for seasonality. Use contextual anomaly detection that considers time of day and day of week.
Grafana ML requires too much data -- The plugin needs at least 7 days of historical data for training. Seed it with a longer data range initially.
Anomaly detector consumes too much CPU -- The
window_sizeis too large or the evaluation interval is too frequent. Reduce the evaluation interval to every 15 minutes instead of every minute.Prometheus anomaly detector returns no results -- The Prometheus URL or query is misconfigured. Test the query directly in Prometheus first.
Practice Questions
What is a Z-score and how is it used for anomaly detection? Answer: A Z-score measures how many standard deviations a point is from the mean. Values with Z-score > 3 are typically considered anomalous.
Why does moving-average anomaly detection work better for gradual changes? Answer: The moving average adapts to the current level, so gradual trends do not trigger false alarms -- only rapid deviations from the recent trend are flagged.
What is the advantage of multivariate anomaly detection over univariate? Answer: Multivariate methods detect anomalies that involve multiple metrics simultaneously, like high CPU combined with high disk I/O, which might appear normal individually.
How does seasonal decomposition help detect contextual anomalies? Answer: It separates data into trend, seasonal, and residual components. Anomalies in the residual component are contextual -- they deviate from the expected seasonal pattern.
What is the purpose of the contamination parameter in Isolation Forest? Answer: It sets the expected proportion of anomalies in the data, controlling the threshold for flagging points as anomalous.
Challenge
Collect 30 days of metrics from a web application (CPU, memory, request latency, error rate, and request volume). Implement three anomaly detection methods: Z-score (point anomalies), moving average (trend deviations), and seasonal decomposition (contextual anomalies in hourly patterns). Train an Isolation Forest model on all five metrics and compare its findings with the statistical methods. Build a Grafana dashboard with: a time-series panel showing raw data with anomaly overlays in red, a table listing detected anomalies with severity scores, and a heatmap showing anomaly frequency by hour of day. Set up an alert that fires when the Isolation Forest detects an anomaly in at least two metrics simultaneously.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro