Anomaly Detection in Metrics: Statistical and ML-Based Methods

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Anomaly Detection in Metrics: Statistical and ML. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You Will Learn

This tutorial teaches you how to detect anomalies in time-series metrics using statistical techniques (Z-score, moving averages, seasonal decomposition) and Machine Learning approaches (isolation forests, autoencoders), along with practical implementation using Prometheus and Grafana.

Why It Matters

Static thresholds cannot catch every problem. A gradual memory leak looks normal against a fixed threshold until the server crashes. Anomaly detection learns the normal pattern of your metrics and alerts you when behavior deviates, catching issues that static rules miss entirely.

Real-World Use

The DodaTech infrastructure team uses anomaly detection on disk I/O metrics. When a background backup process started running during business hours instead of midnight, the anomaly detection flagged the unusual daytime I/O pattern within 15 minutes -- before it could degrade production database performance.

Anomaly detection identifies data points that deviate significantly from the expected pattern. In time-series monitoring, anomalies can be: point anomalies (single spike), contextual anomalies (normal value but wrong time), or collective anomalies (a sequence of values that is unusual as a group).

Prerequisites

A Prometheus instance with at least 30 days of metrics (see Prometheus Introduction)
Python 3.8+ with scipy and numpy
A Grafana Dashboards instance for visualization
Basic statistics knowledge (mean, standard deviation)

Step-by-Step Tutorial

Step 1: Statistical Anomaly Detection with Z-Score

The Z-score measures how many standard deviations a data point is from the mean.

import numpy as np
from scipy import stats

def detect_anomalies_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    return np.where(z_scores > threshold)[0]

# Example with CPU metric
cpu_usage = [45, 47, 46, 48, 44, 85, 46, 45, 47]  # 85 is the anomaly
anomalies = detect_anomalies_zscore(cpu_usage)
print(f"Anomalous indices: {anomalies}")  # [5]

Expected output: Index 5 (value 85) is flagged as anomalous.

Step 2: Moving Average with Standard Deviation Bands

import pandas as pd
import numpy as np

def detect_with_moving_avg(series, window=10, sigma=3):
    rolling_mean = series.rolling(window=window).mean()
    rolling_std = series.rolling(window=window).std()
    upper = rolling_mean + (sigma * rolling_std)
    lower = rolling_mean - (sigma * rolling_std)
    anomalies = series[(series > upper) | (series < lower)]
    return anomalies

# Load metrics from CSV
df = pd.read_csv("metrics.csv")
anomalies = detect_with_moving_avg(df["cpu_usage"])
print(f"Found {len(anomalies)} anomalies")

Step 3: Seasonal Decomposition for Recurring Patterns

from statsmodels.tsa.seasonal import seasonal_decompose

def seasonal_anomaly_detection(series, period=1440):  # 1440 = daily (minute data)
    result = seasonal_decompose(series, model="additive", period=period)
    residual = result.resid
    anomalies = np.abs(residual) > 3 * np.nanstd(residual)
    return anomalies

Step 4: PromQL for Anomaly Detection

# Simple Z-score approximation in PromQL
(
  node_memory_MemAvailable_bytes
  -
  avg_over_time(node_memory_MemAvailable_bytes[24h])
)
/
stddev_over_time(node_memory_MemAvailable_bytes[24h])

Expected output: Z-score values. Values above 3 or below -3 indicate anomalies.

Step 5: Use Grafana Machine Learning (Grafana ML)

Grafana ML is a plugin that provides anomaly detection.

Install the Grafana ML plugin:

<a href="/devops/prometheus-grafana/">Grafana</a>-cli plugins install grafana-ml-app

Enable the plugin in <a href="/devops/prometheus-grafana/">Grafana</a>.ini:

[plugins]
enable_alpha = true

In Grafana, go to ML > Anomaly Detection
Select a time series query
Set sensitivity (higher = more anomalies detected)
Train the model (requires 7+ days of data)
View anomaly timeline

Step 6: Implement Isolation Forest for Multivariate Detection

from sklearn.ensemble import IsolationForest

def isolation_forest_detection(metrics_df, contamination=0.01):
    model = IsolationForest(contamination=contamination, random_state=42)
    predictions = model.fit_predict(metrics_df)
    anomalies = metrics_df[predictions == -1]
    return anomalies

# Multiple metrics as features
features = df[["cpu_usage", "memory_usage", "disk_io", "network_in"]]
anomalies = isolation_forest_detection(features)
print(f"Found {len(anomalies)} multivariate anomalies")

Step 7: Deploy the Prometheus Anomaly Detector

docker run -d --name anomaly-detector \
  -p 8080:8080 \
  -e PROMETHEUS_URL=http://prometheus:9090 \
  -e METRICS_QUERY='node_memory_MemAvailable_bytes' \
  -e DETECTION_METHOD='zscore' \
  -e ZSCORE_THRESHOLD=3 \
  -e WINDOW_SIZE=24h \
  bitnami/prometheus-anomaly-detector:latest

Step 8: Visualize Anomalies in Grafana

# Query for anomalies (values where Z-score > 3 or < -3)
(
  (node_memory_MemAvailable_bytes - avg_over_time(node_memory_MemAvailable_bytes[24h]))
  /
  stddev_over_time(node_memory_MemAvailable_bytes[24h])
) > 3

In the Grafana panel, set:

Threshold: Show regions between 3 and infinity as critical
Threshold: Show regions between -infinity and -3 as critical
Override color for the threshold region to red

Learning Path

flowchart LR
    A[Metric Data] --> B[Statistical Methods]
    A --> C[Machine Learning]
    B --> D[Z-Score]
    B --> E[Moving Average]
    B --> F[Seasonal Decomposition]
    C --> G[Isolation Forest]
    C --> H[Autoencoder]
    D --> I[Anomaly Alerts]
    E --> I
    F --> I
    G --> I
    H --> I
    style A fill:#4a90d9,color:#fff
    style I fill:#e67e22,color:#fff

Common Errors

Z-score misses gradual trends -- A memory leak causes a slow increase over days. The Z-score sees it as normal because the mean is also increasing. Use moving-average Z-score instead.
Seasonal decomposition fails with irregular periods -- The period parameter must match the data frequency. Use autocorrelation to find the dominant period.
Isolation Forest flags normal patterns as anomalies -- The contamination parameter is too high. Start with 0.01 and adjust based on results.
High false positive rate from static thresholds -- The anomaly detection does not account for seasonality. Use contextual anomaly detection that considers time of day and day of week.
Grafana ML requires too much data -- The plugin needs at least 7 days of historical data for training. Seed it with a longer data range initially.
Anomaly detector consumes too much CPU -- The window_size is too large or the evaluation interval is too frequent. Reduce the evaluation interval to every 15 minutes instead of every minute.
Prometheus anomaly detector returns no results -- The Prometheus URL or query is misconfigured. Test the query directly in Prometheus first.

Practice Questions

What is a Z-score and how is it used for anomaly detection? Answer: A Z-score measures how many standard deviations a point is from the mean. Values with Z-score > 3 are typically considered anomalous.
Why does moving-average anomaly detection work better for gradual changes? Answer: The moving average adapts to the current level, so gradual trends do not trigger false alarms -- only rapid deviations from the recent trend are flagged.
What is the advantage of multivariate anomaly detection over univariate? Answer: Multivariate methods detect anomalies that involve multiple metrics simultaneously, like high CPU combined with high disk I/O, which might appear normal individually.
How does seasonal decomposition help detect contextual anomalies? Answer: It separates data into trend, seasonal, and residual components. Anomalies in the residual component are contextual -- they deviate from the expected seasonal pattern.
What is the purpose of the contamination parameter in Isolation Forest? Answer: It sets the expected proportion of anomalies in the data, controlling the threshold for flagging points as anomalous.

Challenge

Collect 30 days of metrics from a web application (CPU, memory, request latency, error rate, and request volume). Implement three anomaly detection methods: Z-score (point anomalies), moving average (trend deviations), and seasonal decomposition (contextual anomalies in hourly patterns). Train an Isolation Forest model on all five metrics and compare its findings with the statistical methods. Build a Grafana dashboard with: a time-series panel showing raw data with anomaly overlays in red, a table listing detected anomalies with severity scores, and a heatmap showing anomaly frequency by hour of day. Set up an alert that fires when the Isolation Forest detects an anomaly in at least two metrics simultaneously.

FAQ

Can anomaly detection replace static threshold alerts?

No, anomaly detection complements static thresholds. Use static thresholds for known failure conditions (disk full) and anomaly detection for unknown failure patterns.

How much historical data do I need for anomaly detection?

Statistical methods need at least 24 hours. ML methods (Isolation Forest) need at least 7 days. Seasonal decomposition needs at least 2 full periods (e.g., 14 days for a 7-day season).

Does anomaly detection work for low-volume metrics?

It is harder because the variance is high and patterns are less clear. Aggregate low-volume metrics into larger time windows.

What is the difference between supervised and unsupervised anomaly detection?

Supervised methods require labeled normal/anomaly data. Unsupervised methods (used here) assume anomalies are rare and learn the normal pattern without labels.

Can I run anomaly detection in Prometheus directly?

Prometheus does not have built-in anomaly detection. Use Grafana ML, the Prometheus Anomaly Detector sidecar, or external tools that query Prometheus.

← Previous Service Level Objectives: SLIs and Error Budgets Explained Next → Monitoring Pipelines: Telegraf, Vector, and Fluentd

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Observability