Metrics Collection: System and Application Metrics Explained
In this tutorial, you'll learn about Metrics Collection: System and Application Metrics Explained. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You Will Learn
This tutorial teaches you how to collect system-level metrics (CPU, memory, disk, network) and application-level metrics (request rate, error rate, latency) using Prometheus exporters and client libraries.
Why It Matters
If you do not collect metrics, you cannot detect anomalies, set alerts, or understand capacity. Comprehensive metrics collection is the foundation of every Observability strategy.
Real-World Use
The DodaZIP compression service processes millions of files daily. Metrics collection revealed that CPU usage spikes every hour due to a scheduled antivirus scan, causing compression latency to triple. The team rescheduled the scan to off-peak hours after seeing the data.
Metrics are numeric measurements collected over time. They fall into four types: counters (always increase), gauges (go up and down), histograms (distribution of values), and summaries (quantile estimates). Prometheus provides client libraries for all major languages and a wide ecosystem of exporters.
Prerequisites
- A running Prometheus instance (see Prometheus Introduction)
- A Linux server with SSH access
- Docker installed for running exporters
- Basic knowledge of the Linux command line
Step-by-Step Tutorial
Step 1: Deploy the Node Exporter for System Metrics
docker run -d --name node_exporter \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host
Expected output: Node Exporter listens on port 9100. Visit http://localhost:9100/metrics to see hundreds of metrics.
Step 2: Scrape Node Exporter in Prometheus
Add to <a href="/devops/prometheus-grafana/">prometheus</a>.yml:
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
Step 3: Key System Metrics to Monitor
# CPU usage percentage
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk space usage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
# Network throughput
rate(node_network_receive_bytes_total[5m])
Step 4: Instrument a Python Application
Install the Prometheus client library:
pip install prometheus-client
Create app.py:
from prometheus_client import start_http_server, Counter, Histogram, Gauge
import time
import random
REQUEST_COUNT = Counter("app_requests_total", "Total requests", ["method", "endpoint"])
REQUEST_LATENCY = Histogram("app_request_latency_seconds", "Request latency", ["method"])
ACTIVE_USERS = Gauge("app_active_users", "Currently active users")
def handle_request(method, endpoint):
REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
ACTIVE_USERS.inc()
start = time.time()
time.sleep(random.uniform(0.01, 0.5))
REQUEST_LATENCY.labels(method=method).observe(time.time() - start)
ACTIVE_USERS.dec()
if __name__ == "__main__":
start_http_server(8000)
while True:
handle_request("GET", "/api/users")
time.sleep(0.1)
Expected output: Metrics available at http://localhost:8000/metrics.
Step 5: Understand Histogram Quantiles
# p99 latency in seconds
histogram_quantile(0.99, rate(app_request_latency_seconds_bucket[5m]))
# Average latency
rate(app_request_latency_seconds_sum[5m]) / rate(app_request_latency_seconds_count[5m])
Step 6: Deploy Exporters for Common Services
# PostgreSQL Exporter
docker run -d --name pg_exporter \
-e DATA_SOURCE_NAME="postgresql://user:pass@localhost:5432/db?sslmode=disable" \
-p 9187:9187 \
prometheuscommunity/postgres-exporter:latest
# Redis Exporter
docker run -d --name redis_exporter \
-e REDIS_ADDR=redis://localhost:6379 \
-p 9121:9121 \
oliver006/redis_exporter:latest
# Nginx Exporter (requires nginx with stub_status)
docker run -d --name nginx_exporter \
-e NGINX_STATUS=http://localhost:8080/nginx_status \
-p 9113:9113 \
nginx/nginx-prometheus-exporter:latest
Step 7: Create a Textfile Collector for Custom Scripts
The Node Exporter textfile collector lets you expose metrics from cron jobs:
#!/bin/bash
# /usr/local/bin/db_backup_metrics.sh
echo "# HELP db_backup_duration_seconds Duration of last DB backup"
echo "# TYPE db_backup_duration_seconds gauge"
echo "db_backup_duration_seconds $(date +%s) $(du -s /backup | awk '{print $1}')" \
> /var/lib/node_exporter/textfile_collector/backup.prom
Add --collector.textfile.directory=/var/lib/node_exporter/textfile_collector to the Node Exporter startup flags.
Step 8: Validate All Targets
curl http://localhost:9090/api/v1/targets
Look for all exporters in the UP state.
Learning Path
flowchart LR
A[Metrics Collection] --> B[System Metrics]
A --> C[Application Metrics]
A --> D[Service Exporters]
B --> E[Node Exporter]
C --> F[Client Libraries]
D --> G[PostgreSQL/Redis/Nginx]
E --> H[Prometheus TSDB]
F --> H
G --> H
style A fill:#4a90d9,color:#fff
style H fill:#e67e22,color:#fff
Common Errors
Exporter shows 404 at /metrics -- The exporter URL or port is wrong. Check that the exporter process is listening on the expected port.
Node Exporter shows stale metrics -- The textfile collector file was not updated. Check cron timestamps and file permissions.
High cardinality from label values -- A label contains unique values (like user_id or email). Ensure labels are bounded by design.
Histogram_quantile returns NaN -- Not enough samples in the time window. Increase the range or wait for more data.
Application metrics not showing up -- The Prometheus target configuration has the wrong port or path. Verify the
/metricsendpoint returns data.Docker exporter exits immediately -- Environment variables are missing or the target service is unreachable. Check logs with
docker logs.Prometheus rejects target with invalid labels -- Label names must match
[a-zA-Z_][a-zA-Z0-9_]*. Replace hyphens with underscores.
Practice Questions
What is the difference between a Counter and a Gauge? Answer: A Counter only increases (requests, errors). A Gauge can go up and down (memory, temperature).
What does the histogram_quantile function compute? Answer: It estimates the nth percentile latency from histogram bucket counters.
Why does Prometheus prefer a pull model over a push model? Answer: The pull model simplifies service discovery, reduces coupling, and lets the monitoring system control collection frequency.
What port does the Node Exporter listen on by default? Answer: 9100.
How do you expose batch job metrics to Prometheus? Answer: Use the Textfile Collector: write metrics to a
.promfile and have Node Exporter read it.
Challenge
Set up a complete metrics collection pipeline for a web application running behind Nginx with a PostgreSQL database. Deploy Node Exporter for the host, the Nginx exporter for web server metrics, and the PostgreSQL exporter for database metrics. Write a Python script that exposes custom application metrics (request rate, error rate, and latency percentiles). Configure Prometheus to scrape all four exporters. Verify that you can query a metric from each exporter using PromQL. Create a recording rule that combines CPU and disk metrics into a single "health score" metric.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro