Prometheus â Configuration, Service Discovery & Alerting Guide
In this tutorial, you'll learn about Prometheus. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Prometheus is an open-source monitoring and alerting toolkit that collects metrics from configured targets at regular intervals, evaluates rule expressions, and triggers alerts when conditions are met.
What You'll Learn
Why It Matters
Traditional monitoring (Nagios, Zabbix) uses push-based agents with static configuration, making them hard to manage in dynamic container environments. Prometheus uses pull-based scraping with service discovery, automatically finding and monitoring targets as they scale up and down. DodaTech monitors 1,200+ targets across 5 Kubernetes clusters using Prometheus with Consul and Kubernetes service discovery.
Real-World Use
DodaZIP's platform team runs Prometheus in high-availability mode with Thanos for long-term storage. Each Kubernetes cluster has a dedicated Prometheus that scrapes nodes, pods, services, and custom application metrics. Alertmanager routes alerts to PagerDuty for on-call engineers and Slack for team visibility.
flowchart TD
A[Prometheus Server] --> B[Service Discovery]
B --> C[Kubernetes API]
B --> D[Consul]
B --> E[File SD]
B --> F[EC2 / GCE]
A --> G[Scrape Targets]
G --> H[Node Exporters]
G --> I[Pods & Services]
G --> J[Custom App Metrics]
A --> K[Recording Rules]
A --> L[Alerting Rules]
L --> M[Alertmanager]
M --> N[PagerDuty]
M --> O[Slack]
M --> P[Email]
style A fill:#E6522C,color:#fff
style M fill:#E6522C,color:#fff
Prerequisites: Basic understanding of Linux administration and YAML syntax.
Installation
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
# Extract and install
tar xzf prometheus-2.53.0.linux-amd64.tar.gz
sudo mv prometheus-2.53.0.linux-amd64 /opt/prometheus
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
# Verify
/opt/prometheus/prometheus --version
# Expected output:
# prometheus, version 2.53.0 (branch: HEAD, revision: abc123)
# build user: root@abc123
# build date: 20260601
# go version: go1.22.4
# platform: linux/amd64
Basic Configuration
# /opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
external_labels:
cluster: production-us-east
environment: production
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-01:9093
- alertmanager-02:9093
scheme: http
timeout: 5s
rule_files:
- /opt/prometheus/rules/recording-rules.yml
- /opt/prometheus/rules/alerting-rules.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
labels:
service: monitoring
component: prometheus
- job_name: node
consul_sd_configs:
- server: consul.dodatech.com:8500
services:
- node_exporter
relabel_configs:
- source_labels: [__meta_consul_service_id]
target_label: instance
- source_labels: [__meta_consul_dc]
target_label: datacenter
metric_relabel_configs:
- source_labels: [__name__]
regex: '(container_.*|etcd_.*)'
action: drop
Service Discovery
# Kubernetes service discovery
scrape_configs:
- job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
api_server: https://kubernetes.default.svc
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: kubernetes_pod_name
# EC2 service discovery (AWS)
- job_name: aws-ec2
ec2_sd_configs:
- region: us-east-1
access_key: AKIAIOSFODNN7EXAMPLE
secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
- source_labels: [__meta_ec2_tag_Environment]
target_label: environment
- source_labels: [__meta_ec2_availability_zone]
target_label: availability_zone
Recording Rules
# /opt/prometheus/rules/recording-rules.yml
groups:
- name: compute_resources
interval: 30s
rules:
- record: node:cpu_utilization_avg:rate5m
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: node:memory_utilization:ratio
expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
- record: node:disk_utilization:ratio
expr: 1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)
- record: namespace:pod_cpu_usage:sum
expr: sum by (namespace) (rate(container_cpu_usage_seconds_total[5m]))
- record: namespace:pod_memory_usage:sum
expr: sum by (namespace) (container_memory_working_set_bytes)
Alerting Rules
# /opt/prometheus/rules/alerting-rules.yml
groups:
- name: host_alerts
interval: 30s
rules:
- alert: HostHighCpuLoad
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Host CPU load is high"
description: "CPU utilization on {{ $labels.instance }} is {{ $value | humanizePercentage }}"
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Host out of memory"
description: "Available memory on {{ $labels.instance }} is {{ $value | humanizePercentage }}"
- alert: HostDiskRunningOut
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Host disk is running out of space"
description: "Available disk space on {{ $labels.instance }} is {{ $value | humanizePercentage }}"
- alert: HostUnreachable
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Host is unreachable"
description: "Prometheus target {{ $labels.instance }} has been down for more than 2 minutes"
- name: kubernetes_alerts
interval: 30s
rules:
- alert: KubernetesPodCrashLooping
expr: kube_pod_container_status_restarts_total > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times"
- alert: KubernetesPodNotReady
expr: kube_pod_status_ready{condition="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is not ready"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been not ready for 5 minutes"
Alertmanager Configuration
# /opt/alertmanager/alertmanager.yml
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'pagerduty-critical'
routes:
- match:
severity: critical
receiver: pagerduty-critical
repeat_interval: 1h
- match:
severity: warning
receiver: slack-warnings
repeat_interval: 4h
- match_re:
service: ^(web|api)
receiver: slack-web-team
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: YOUR_PAGERDUTY_KEY
severity: critical
description: '{{ template "pagerduty.default.description" . }}'
- name: slack-warnings
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00/B00/XXXX'
channel: '#devops-alerts'
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
send_resolved: true
- name: slack-web-team
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00/B00/YYYY'
channel: '#web-team-alerts'
title: '{{ template "slack.web.title" . }}'
send_resolved: true
Common Configuration Mistakes
Setting scrape_interval too low: Scraping 1,000+ targets every 5 seconds overwhelms the Prometheus server. Use 15s default and increase only for specific high-resolution jobs.
Missing relabel_configs for Kubernetes SD: Without relabeling, Kubernetes service discovery produces unusable labels. Map
__meta_Kubernetes_*labels to meaningful target names.Not using recording rules for expensive queries: Complex PromQL queries on every evaluation time out. Pre-compute expensive aggregations with recording rules.
Ignoring retention settings: Default retention (15 days) fills disk on busy servers. Set
--storage.tsdb.retention.time=30dand--storage.tsdb.retention.size=100GB.Single Alertmanager without high availability: If Alertmanager goes down, no alerts fire. Run at least 2 Alertmanager instances with
--cluster.listen-addressfor gossip-based HA.
Practice Questions
What is the difference between pull and push monitoring? Answer: Pull-based (Prometheus) scrapes targets at defined intervals. Push-based (Graphite, InfluxDB) receives metrics from agents. Pull simplifies discovery and reduces attack surface.
How does service discovery work in Prometheus? Answer: Prometheus integrates with Kubernetes, Consul, EC2, GCE, Azure, and file-based SD to automatically discover targets and their labels without manual configuration.
What is the purpose of recording rules? Answer: Recording rules pre-compute frequently used or expensive queries and store them as new time series, reducing query load and speeding up dashboard rendering.
How does Alertmanager deduplicate alerts? Answer: Alertmanager groups similar alerts by labels (
group_by), waits forgroup_waitto batch them, and sends consolidated notifications, reducing alert fatigue.
Challenge
Deploy a production Prometheus monitoring stack: configure Prometheus with Kubernetes and Consul service discovery, write recording rules for CPU, memory, and disk utilization by namespace, create alerting rules for host and pod health, set up Alertmanager with PagerDuty and Slack receivers, and test each alert by simulating failure scenarios.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro