PromQL — Prometheus Query Language Complete Guide

DodaTech Updated 2026-06-24 5 min read

In this tutorial, you'll learn about PromQL. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

PromQL (Prometheus Query Language) is a functional query language for Prometheus that lets you select, aggregate, transform, and analyze time series data in real time for dashboards and alerting.

What You'll Learn

Why It Matters

Monitoring data is useless if you cannot ask meaningful questions. PromQL queries power every Grafana dashboard, every alert, and every ad-hoc investigation. DodaTech's SRE team spends 40% of Incident Response time writing and refining PromQL queries to understand system behavior. Mastering PromQL cuts this time to minutes.

Real-World Use

When DodaZIP's API latency spiked, the SRE team used a single PromQL query to identify the specific pod, namespace, and database query causing the slowdown: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (pod, db_operation).

flowchart LR
    A[Metrics] --> B[PromQL Query]
    B --> C[Instant Vector]
    B --> D[Range Vector]
    B --> E[Scalar]
    B --> F[String]
    C --> G[Alerting Rules]
    D --> H[Recording Rules]
    C --> I[Grafana Dashboard]
    H --> C
    style B fill:#E6522C,color:#fff

ℹ️ Info

Prerequisites: Basic understanding of Prometheus metrics model and Grafana dashboards.

Core Data Types

PromQL operates on four data types:

Type	Description	Example
Instant Vector	Set of time series at a single timestamp	`node_cpu_seconds_total`
Range Vector	Set of time series over a time range	`node_cpu_seconds_total[5m]`
Scalar	A single numeric value	`1024`
String	A string value (rarely used)	`"production"`

Metric Selectors

# Simple selector — returns all series with this metric name
node_memory_MemTotal_bytes

# Selector with label matcher
node_cpu_seconds_total{mode="idle", instance="web-01:9100"}

# Supported label matchers:
# =   Equal
# !=  Not equal
# =~  Regex match
# !~  Regex not match
node_cpu_seconds_total{instance=~"web-.*", mode!="idle"}

# Select by metric name pattern
{__name__=~"node_.*_total", instance="web-01:9100"}

Range Vectors

# Rate over 5 minutes
rate(node_cpu_seconds_total{mode="idle"}[5m])

# Increase over 1 hour
increase(node_network_receive_bytes_total[1h])

# Average over 30 minutes
avg_over_time(node_memory_MemFree_bytes[30m])

# Maximum over 10 minutes
max_over_time(http_request_duration_seconds_max[10m])

# Quantile over 5 minutes
quantile_over_time(0.99, http_request_duration_seconds[5m])

Functions

# Rate of increase per second
rate(node_cpu_seconds_total[5m])

# Absolute increase over interval
increase(node_network_receive_bytes_total[1h])

# Irate (instant rate — last two samples, useful for spiky metrics)
irate(node_cpu_seconds_total[5m])

# Derivative (slope of linear regression)
deriv(node_disk_read_bytes_total[1h])

# Predict linear regression (forecast)
predict_linear(node_filesystem_free_bytes[6h], 3600)

# Count of series
count(node_cpu_seconds_total)

# Top k by value
topk(5, node_memory_MemFree_bytes)

# Bottom k
bottomk(3, rate(http_requests_total[5m]))

# Quantile aggregation across series
quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Aggregation Operators

# Sum by label
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

# Average by namespace
avg by (namespace) (rate(container_cpu_usage_seconds_total[5m]))

# Max across all instances
max(rate(node_network_receive_bytes_total[5m]))

# Min by service
min by (service) (http_request_duration_seconds_count)

# Count of instances per service
count by (service) (up)

# Standard deviation
stddev by (namespace) (container_memory_working_set_bytes)

# Variance
stdvar by (instance) (rate(node_cpu_seconds_total[5m]))

Operators

# Arithmetic operators
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100

# Comparison operators (producing 0 or 1)
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes > 0.2

# Boolean mode — returns 0/1
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes > bool 0.2

# Logical/set operators
# CPU or memory of web instances
rate(node_cpu_seconds_total[5m]) or rate(node_memory_MemFree_bytes[5m])

# Series matching both conditions (and)
node_memory_MemTotal_bytes{instance=~"web.*"} and node_cpu_seconds_total{instance=~"web.*"}

Vector Matching

# One-to-one matching (same labels)
rate(http_requests_total{status="500"}[5m])
/
rate(http_requests_total[5m])

# Many-to-one using group_left
rate(http_requests_total[5m])
/
on(namespace) group_left(service) sum by (namespace, service) (rate(http_requests_total[5m]))

# Many-to-one using group_right (reverse)
sum by (instance, cluster) (rate(node_cpu_seconds_total[5m]))
/
on(cluster) group_right(instance) (
  count by (cluster) (node_cpu_seconds_total)
)

Histogram Queries

# 99th percentile request duration
histogram_quantile(0.99,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)

# 50th and 90th percentiles
histogram_quantile(0.50,
  rate(http_request_duration_seconds_bucket[5m])
)
histogram_quantile(0.90,
  rate(http_request_duration_seconds_bucket[5m])
)

# Average from histogram (requires _sum and _count)
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])

Subqueries

# Average rate over 5m, evaluated hourly
max_over_time(
  rate(node_cpu_seconds_total{mode="idle"}[5m])
[1h:5m]
)

# 99th percentile request duration over last 6 hours, sampled every 15 minutes
quantile_over_time(0.99,
  rate(http_request_duration_seconds_bucket{service="api"}[5m])
[6h:15m]
)

# CPU utilization trend over last day
deriv(
  rate(node_cpu_seconds_total{mode="idle"}[5m])
[24h:5m]
)

Common Use Patterns

# CPU utilization percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk space usage percentage
(1 - (node_filesystem_avail_bytes{fstype!=""} /
      node_filesystem_size_bytes{fstype!=""})) * 100

# Error rate (percentage of 5xx responses)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Request rate per service
sum by (service) (rate(http_requests_total[5m]))

# Pod restarts in last hour
changes(kube_pod_container_status_restarts_total[1h])

# Per-second disk I/O
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])

# Network throughput in Mbps
rate(node_network_receive_bytes_total[5m]) * 8 / 1000000

# Container memory usage as percentage of limit
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100

Common Query Mistakes

Mixing instant and range vectors incorrectly: rate(metric[5m]) returns an instant vector, but metric[5m] alone returns a range vector. Functions like rate, increase, and avg_over_time require range vector input.
Forgetting label matching in binary expressions: metric_a / metric_b requires matching label sets. Use on() and group_left()/group_right() for different label sets.
Using rate on monotonic counters vs gauges: rate() and increase() only work on counters (monotonically increasing). Gauges use avg_over_time, max_over_time, etc.
Not handling stale metrics: If a target stops reporting, PromQL continues returning the last value for 5 minutes. Use default() or unless operators to handle missing data.
Ignoring bucket bounds in histogram queries: histogram_quantile requires the le label (less-than-or-equal bucket bounds). Missing or incorrect le values produce wrong percentiles.

Practice Questions

What is the difference between rate() and increase()? Answer: rate() calculates per-second average rate of increase over a range. increase() calculates the total increase over the range. rate() is preferred for alerting as it normalizes to per-second.
How does histogram_quantile work? Answer: histogram_quantile estimates the nth percentile from a histogram's cumulative bucket counters. It interpolates between bucket boundaries for precision.
When would you use group_left or group_right? Answer: These operators allow many-to-one or one-to-many vector matching when the left and right operands have different label cardinalities (e.g., many instances per cluster).
What is the purpose of recording rules with PromQL? Answer: Recording rules pre-compute expensive PromQL queries into new time series, reducing dashboard load times and alert evaluation latency.

Challenge

Write PromQL queries for a complete application monitoring dashboard: request rate (rps) by endpoint and HTTP status code, 99th/95th/50th percentile latency, error rate percentage, active requests count, database query duration by operation type, Garbage Collection pauses, and throughput per second. Create recording rules for the most expensive queries and set up alerting rules for high error rate and latency SLO violations.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Prometheus — Configuration, Service Discovery & Alerting Guide Next → Grafana Dashboards — Visualization, Alerts & Dashboard as Code Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Devops Tools