PromQL â Prometheus Query Language Complete Guide
In this tutorial, you'll learn about PromQL. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
PromQL (Prometheus Query Language) is a functional query language for Prometheus that lets you select, aggregate, transform, and analyze time series data in real time for dashboards and alerting.
What You'll Learn
Why It Matters
Monitoring data is useless if you cannot ask meaningful questions. PromQL queries power every Grafana dashboard, every alert, and every ad-hoc investigation. DodaTech's SRE team spends 40% of Incident Response time writing and refining PromQL queries to understand system behavior. Mastering PromQL cuts this time to minutes.
Real-World Use
When DodaZIP's API latency spiked, the SRE team used a single PromQL query to identify the specific pod, namespace, and database query causing the slowdown: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (pod, db_operation).
flowchart LR
A[Metrics] --> B[PromQL Query]
B --> C[Instant Vector]
B --> D[Range Vector]
B --> E[Scalar]
B --> F[String]
C --> G[Alerting Rules]
D --> H[Recording Rules]
C --> I[Grafana Dashboard]
H --> C
style B fill:#E6522C,color:#fff
Prerequisites: Basic understanding of Prometheus metrics model and Grafana dashboards.
Core Data Types
PromQL operates on four data types:
| Type | Description | Example |
|---|---|---|
| Instant Vector | Set of time series at a single timestamp | node_cpu_seconds_total |
| Range Vector | Set of time series over a time range | node_cpu_seconds_total[5m] |
| Scalar | A single numeric value | 1024 |
| String | A string value (rarely used) | "production" |
Metric Selectors
# Simple selector â returns all series with this metric name
node_memory_MemTotal_bytes
# Selector with label matcher
node_cpu_seconds_total{mode="idle", instance="web-01:9100"}
# Supported label matchers:
# = Equal
# != Not equal
# =~ Regex match
# !~ Regex not match
node_cpu_seconds_total{instance=~"web-.*", mode!="idle"}
# Select by metric name pattern
{__name__=~"node_.*_total", instance="web-01:9100"}
Range Vectors
# Rate over 5 minutes
rate(node_cpu_seconds_total{mode="idle"}[5m])
# Increase over 1 hour
increase(node_network_receive_bytes_total[1h])
# Average over 30 minutes
avg_over_time(node_memory_MemFree_bytes[30m])
# Maximum over 10 minutes
max_over_time(http_request_duration_seconds_max[10m])
# Quantile over 5 minutes
quantile_over_time(0.99, http_request_duration_seconds[5m])
Functions
# Rate of increase per second
rate(node_cpu_seconds_total[5m])
# Absolute increase over interval
increase(node_network_receive_bytes_total[1h])
# Irate (instant rate â last two samples, useful for spiky metrics)
irate(node_cpu_seconds_total[5m])
# Derivative (slope of linear regression)
deriv(node_disk_read_bytes_total[1h])
# Predict linear regression (forecast)
predict_linear(node_filesystem_free_bytes[6h], 3600)
# Count of series
count(node_cpu_seconds_total)
# Top k by value
topk(5, node_memory_MemFree_bytes)
# Bottom k
bottomk(3, rate(http_requests_total[5m]))
# Quantile aggregation across series
quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Aggregation Operators
# Sum by label
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
# Average by namespace
avg by (namespace) (rate(container_cpu_usage_seconds_total[5m]))
# Max across all instances
max(rate(node_network_receive_bytes_total[5m]))
# Min by service
min by (service) (http_request_duration_seconds_count)
# Count of instances per service
count by (service) (up)
# Standard deviation
stddev by (namespace) (container_memory_working_set_bytes)
# Variance
stdvar by (instance) (rate(node_cpu_seconds_total[5m]))
Operators
# Arithmetic operators
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100
# Comparison operators (producing 0 or 1)
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes > 0.2
# Boolean mode â returns 0/1
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes > bool 0.2
# Logical/set operators
# CPU or memory of web instances
rate(node_cpu_seconds_total[5m]) or rate(node_memory_MemFree_bytes[5m])
# Series matching both conditions (and)
node_memory_MemTotal_bytes{instance=~"web.*"} and node_cpu_seconds_total{instance=~"web.*"}
Vector Matching
# One-to-one matching (same labels)
rate(http_requests_total{status="500"}[5m])
/
rate(http_requests_total[5m])
# Many-to-one using group_left
rate(http_requests_total[5m])
/
on(namespace) group_left(service) sum by (namespace, service) (rate(http_requests_total[5m]))
# Many-to-one using group_right (reverse)
sum by (instance, cluster) (rate(node_cpu_seconds_total[5m]))
/
on(cluster) group_right(instance) (
count by (cluster) (node_cpu_seconds_total)
)
Histogram Queries
# 99th percentile request duration
histogram_quantile(0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
# 50th and 90th percentiles
histogram_quantile(0.50,
rate(http_request_duration_seconds_bucket[5m])
)
histogram_quantile(0.90,
rate(http_request_duration_seconds_bucket[5m])
)
# Average from histogram (requires _sum and _count)
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
Subqueries
# Average rate over 5m, evaluated hourly
max_over_time(
rate(node_cpu_seconds_total{mode="idle"}[5m])
[1h:5m]
)
# 99th percentile request duration over last 6 hours, sampled every 15 minutes
quantile_over_time(0.99,
rate(http_request_duration_seconds_bucket{service="api"}[5m])
[6h:15m]
)
# CPU utilization trend over last day
deriv(
rate(node_cpu_seconds_total{mode="idle"}[5m])
[24h:5m]
)
Common Use Patterns
# CPU utilization percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk space usage percentage
(1 - (node_filesystem_avail_bytes{fstype!=""} /
node_filesystem_size_bytes{fstype!=""})) * 100
# Error rate (percentage of 5xx responses)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Request rate per service
sum by (service) (rate(http_requests_total[5m]))
# Pod restarts in last hour
changes(kube_pod_container_status_restarts_total[1h])
# Per-second disk I/O
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])
# Network throughput in Mbps
rate(node_network_receive_bytes_total[5m]) * 8 / 1000000
# Container memory usage as percentage of limit
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100
Common Query Mistakes
Mixing instant and range vectors incorrectly:
rate(metric[5m])returns an instant vector, butmetric[5m]alone returns a range vector. Functions likerate,increase, andavg_over_timerequire range vector input.Forgetting label matching in binary expressions:
metric_a / metric_brequires matching label sets. Useon()andgroup_left()/group_right()for different label sets.Using
rateon monotonic counters vs gauges:rate()andincrease()only work on counters (monotonically increasing). Gauges useavg_over_time,max_over_time, etc.Not handling stale metrics: If a target stops reporting, PromQL continues returning the last value for 5 minutes. Use
default()orunlessoperators to handle missing data.Ignoring bucket bounds in histogram queries:
histogram_quantilerequires thelelabel (less-than-or-equal bucket bounds). Missing or incorrectlevalues produce wrong percentiles.
Practice Questions
What is the difference between
rate()andincrease()? Answer:rate()calculates per-second average rate of increase over a range.increase()calculates the total increase over the range.rate()is preferred for alerting as it normalizes to per-second.How does
histogram_quantilework? Answer:histogram_quantileestimates the nth percentile from a histogram's cumulative bucket counters. It interpolates between bucket boundaries for precision.When would you use
group_leftorgroup_right? Answer: These operators allow many-to-one or one-to-many vector matching when the left and right operands have different label cardinalities (e.g., many instances per cluster).What is the purpose of recording rules with PromQL? Answer: Recording rules pre-compute expensive PromQL queries into new time series, reducing dashboard load times and alert evaluation latency.
Challenge
Write PromQL queries for a complete application monitoring dashboard: request rate (rps) by endpoint and HTTP status code, 99th/95th/50th percentile latency, error rate percentage, active requests count, database query duration by operation type, Garbage Collection pauses, and throughput per second. Create recording rules for the most expensive queries and set up alerting rules for high error rate and latency SLO violations.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro