Metrics Collection: System and Application Metrics Explained

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Metrics Collection: System and Application Metrics Explained. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You Will Learn

This tutorial teaches you how to collect system-level metrics (CPU, memory, disk, network) and application-level metrics (request rate, error rate, latency) using Prometheus exporters and client libraries.

Why It Matters

If you do not collect metrics, you cannot detect anomalies, set alerts, or understand capacity. Comprehensive metrics collection is the foundation of every Observability strategy.

Real-World Use

The DodaZIP compression service processes millions of files daily. Metrics collection revealed that CPU usage spikes every hour due to a scheduled antivirus scan, causing compression latency to triple. The team rescheduled the scan to off-peak hours after seeing the data.

Metrics are numeric measurements collected over time. They fall into four types: counters (always increase), gauges (go up and down), histograms (distribution of values), and summaries (quantile estimates). Prometheus provides client libraries for all major languages and a wide ecosystem of exporters.

Prerequisites

A running Prometheus instance (see Prometheus Introduction)
A Linux server with SSH access
Docker installed for running exporters
Basic knowledge of the Linux command line

Step-by-Step Tutorial

Step 1: Deploy the Node Exporter for System Metrics

docker run -d --name node_exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

Expected output: Node Exporter listens on port 9100. Visit http://localhost:9100/metrics to see hundreds of metrics.

Step 2: Scrape Node Exporter in Prometheus

Add to <a href="/devops/prometheus-grafana/">prometheus</a>.yml:

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

Step 3: Key System Metrics to Monitor

# CPU usage percentage
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk space usage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100

# Network throughput
rate(node_network_receive_bytes_total[5m])

Step 4: Instrument a Python Application

Install the Prometheus client library:

pip install prometheus-client

Create app.py:

from prometheus_client import start_http_server, Counter, Histogram, Gauge
import time
import random

REQUEST_COUNT = Counter("app_requests_total", "Total requests", ["method", "endpoint"])
REQUEST_LATENCY = Histogram("app_request_latency_seconds", "Request latency", ["method"])
ACTIVE_USERS = Gauge("app_active_users", "Currently active users")

def handle_request(method, endpoint):
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    ACTIVE_USERS.inc()
    start = time.time()
    time.sleep(random.uniform(0.01, 0.5))
    REQUEST_LATENCY.labels(method=method).observe(time.time() - start)
    ACTIVE_USERS.dec()

if __name__ == "__main__":
    start_http_server(8000)
    while True:
        handle_request("GET", "/api/users")
        time.sleep(0.1)

Expected output: Metrics available at http://localhost:8000/metrics.

Step 5: Understand Histogram Quantiles

# p99 latency in seconds
histogram_quantile(0.99, rate(app_request_latency_seconds_bucket[5m]))

# Average latency
rate(app_request_latency_seconds_sum[5m]) / rate(app_request_latency_seconds_count[5m])

Step 6: Deploy Exporters for Common Services

# PostgreSQL Exporter
docker run -d --name pg_exporter \
  -e DATA_SOURCE_NAME="postgresql://user:pass@localhost:5432/db?sslmode=disable" \
  -p 9187:9187 \
  prometheuscommunity/postgres-exporter:latest

# Redis Exporter
docker run -d --name redis_exporter \
  -e REDIS_ADDR=redis://localhost:6379 \
  -p 9121:9121 \
  oliver006/redis_exporter:latest

# Nginx Exporter (requires nginx with stub_status)
docker run -d --name nginx_exporter \
  -e NGINX_STATUS=http://localhost:8080/nginx_status \
  -p 9113:9113 \
  nginx/nginx-prometheus-exporter:latest

Step 7: Create a Textfile Collector for Custom Scripts

The Node Exporter textfile collector lets you expose metrics from cron jobs:

#!/bin/bash
# /usr/local/bin/db_backup_metrics.sh
echo "# HELP db_backup_duration_seconds Duration of last DB backup"
echo "# TYPE db_backup_duration_seconds gauge"
echo "db_backup_duration_seconds $(date +%s) $(du -s /backup | awk '{print $1}')" \
  > /var/lib/node_exporter/textfile_collector/backup.prom

Add --collector.textfile.directory=/var/lib/node_exporter/textfile_collector to the Node Exporter startup flags.

Step 8: Validate All Targets

curl http://localhost:9090/api/v1/targets

Look for all exporters in the UP state.

Learning Path

flowchart LR
    A[Metrics Collection] --> B[System Metrics]
    A --> C[Application Metrics]
    A --> D[Service Exporters]
    B --> E[Node Exporter]
    C --> F[Client Libraries]
    D --> G[PostgreSQL/Redis/Nginx]
    E --> H[Prometheus TSDB]
    F --> H
    G --> H
    style A fill:#4a90d9,color:#fff
    style H fill:#e67e22,color:#fff

Common Errors

Exporter shows 404 at /metrics -- The exporter URL or port is wrong. Check that the exporter process is listening on the expected port.
Node Exporter shows stale metrics -- The textfile collector file was not updated. Check cron timestamps and file permissions.
High cardinality from label values -- A label contains unique values (like user_id or email). Ensure labels are bounded by design.
Histogram_quantile returns NaN -- Not enough samples in the time window. Increase the range or wait for more data.
Application metrics not showing up -- The Prometheus target configuration has the wrong port or path. Verify the /metrics endpoint returns data.
Docker exporter exits immediately -- Environment variables are missing or the target service is unreachable. Check logs with docker logs.
Prometheus rejects target with invalid labels -- Label names must match [a-zA-Z_][a-zA-Z0-9_]*. Replace hyphens with underscores.

Practice Questions

What is the difference between a Counter and a Gauge? Answer: A Counter only increases (requests, errors). A Gauge can go up and down (memory, temperature).
What does the histogram_quantile function compute? Answer: It estimates the nth percentile latency from histogram bucket counters.
Why does Prometheus prefer a pull model over a push model? Answer: The pull model simplifies service discovery, reduces coupling, and lets the monitoring system control collection frequency.
What port does the Node Exporter listen on by default? Answer: 9100.
How do you expose batch job metrics to Prometheus? Answer: Use the Textfile Collector: write metrics to a .prom file and have Node Exporter read it.

Challenge

Set up a complete metrics collection pipeline for a web application running behind Nginx with a PostgreSQL database. Deploy Node Exporter for the host, the Nginx exporter for web server metrics, and the PostgreSQL exporter for database metrics. Write a Python script that exposes custom application metrics (request rate, error rate, and latency percentiles). Configure Prometheus to scrape all four exporters. Verify that you can query a metric from each exporter using PromQL. Create a recording rule that combines CPU and disk metrics into a single "health score" metric.

FAQ

What is the difference between a histogram and a summary?

A histogram uses configurable buckets to count observations; a summary computes streaming quantiles on the client side. Histograms are more flexible across services.

How often should I scrape metrics?

Every 10-60 seconds depending on the volatility of the metric. CPU and request rates benefit from 15s intervals; disk space can be scraped every 60s.

Can I collect metrics without Prometheus?

Yes, but collectd, Telegraf, and StatsD are alternatives. Prometheus is the most widely adopted due to its rich query language and ecosystem.

What is the cardinality problem?

Cardinality is the number of unique label combinations. If a label has 1 million values, Prometheus tracks 1 million time series. This can crash the TSDB.

How do I test metric exposure locally?

Run your exporter and visit http://localhost:PORT/metrics in a browser or use curl to verify the output format.

← Previous Logging Best Practices: Structured Logging with ELK and Loki Next → Alerting with Alertmanager: Configuring Alerts and Notifications

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Observability