Logs & Monitoring -- ELK Stack, Loki, Prometheus & Grafana
In this tutorial, you'll learn about Logs & Monitoring. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Centralized logs and monitoring with the ELK Stack, Grafana Loki, Prometheus, and Grafana provides complete Observability over infrastructure, applications, and user-facing services through logs, metrics, and traces.
What You'll Learn
In this tutorial, you will learn how to deploy and configure a full Observability stack combining Elasticsearch-Logstash-Kibana for logs, Prometheus for metrics, Grafana Loki for log aggregation, and Grafana for unified dashboards and alerting.
Why It Matters
When a production system goes down, you need answers fast. Logs tell you what happened, metrics tell you when it started, and traces tell you where the failure originated. Without a centralized monitoring stack, debugging is like finding a needle in a haystack blindfolded.
Real-World Use
Durga Antivirus Pro uses a four-layer Observability stack: Filebeat ships scan logs to Loki, Prometheus collects scan throughput metrics, Grafana visualizes both, and Alertmanager pages the on-call engineer if scan failure rates exceed 1% in any 5-minute window.
Observability Stack Architecture
flowchart TD
A[Application Logs] --> B[Filebeat]
B --> C{Log Storage}
C -->|Full-text search| D[Elasticsearch]
C -->|Cost-effective| E[Grafana Loki]
D --> F[Kibana]
E --> G[Grafana]
H[System Metrics] --> I[Prometheus Node Exporter]
I --> J[Prometheus Server]
J --> G
G --> K[Alertmanager]
K --> L["PagerDuty / Slack / Email"]
Deploying the ELK Stack with Docker Compose
Run Elasticsearch, Logstash, and Kibana in Docker:
version: "3.8"
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.12.0
ports:
- "5000:5000"
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.12.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
volumes:
es_data:
Expected behavior: Elasticsearch starts on port 9200, Kibana on port 5601. Logstash listens on port 5000 for incoming logs and forwards them to Elasticsearch.
Logstash Configuration
Parse incoming logs with Grok patterns:
input {
beats {
port => 5000
}
}
filter {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message_text}"
}
}
date {
match => ["timestamp", "ISO8601"]
}
mutate {
remove_field => ["message"]
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}
Expected behavior: Incoming log lines are parsed into structured fields (timestamp, level, message_text) and indexed daily in Elasticsearch. Raw message field is removed after parsing.
Querying Logs with PromQL and LogQL
Monitor log error rates with Prometheus and search logs with Loki:
# PromQL: rate of 5xx errors
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# LogQL: find all error logs in the last hour
{app="web-server"} |= "ERROR" | json | timestamp > now() - 1h
Expected output: The PromQL query returns a percentage value (e.g., 0.5) representing the current 5xx error rate. The LogQL query returns all log lines from the web-server app containing "ERROR" in the last hour.
Unified Grafana Dashboard
Create a dashboard panel mixing Prometheus metrics and Loki logs:
// Grafana panel JSON model (simplified)
{
"panels": [{
"title": "Error Rate & Logs",
"type": "row",
"panels": [
{ "type": "graph", "datasource": "Prometheus", "targets": [
{ "expr": "rate(http_requests_total{status=~\"5..\"}[5m])" }
]},
{ "type": "logs", "datasource": "Loki", "targets": [
{ "expr": "{app=\"web-server\"} |= \"ERROR\"" }
]}
]
}]
}
Expected behavior: The Grafana dashboard shows a real-time graph of error rates above a live log stream. Clicking a spike on the graph auto-filters the log panel to matching timestamps.
Tool Comparison
| Feature | ELK Stack | Grafana Loki | Prometheus | Datadog |
|---|---|---|---|---|
| Primary data | Logs | Logs | Metrics | All three |
| Storage engine | Elasticsearch | Object store | TSDB | Proprietary |
| Query language | ES DSL / KQL | LogQL | PromQL | Proprietary |
| Retention cost | High | Low (object store) | Medium | High |
| Alerting | Elastic Alerting | Ruler | Alertmanager | Built-in |
| Self-hostable | Yes | Yes | Yes | No |
| Learning curve | Steep | Medium | Medium | Low |
Common Errors
1. Elasticsearch Heap Exhaustion
Elasticsearch defaults to 1GB heap. With heavy log volume, the JVM runs out of memory and the cluster becomes unresponsive. Set ES_JAVA_OPTS="-Xms4g -Xmx4g" for production.
2. Logstash Grok Pattern Not Matching
Grok patterns are case-sensitive and must exactly match the log format. Use the Kibana Grok Debugger to test patterns against sample log lines before deploying.
3. Prometheus Scrape Target Not Reachable
If Prometheus cannot reach a target, check network policy, firewall rules, and the /metrics endpoint. Use curl http://target:9100/metrics to verify the target is exposed.
4. Loki Log Stream Too Many Labels
Loki indexes labels. High-cardinality labels (user IDs, IPs) create too many streams and degrade query performance. Keep label cardinality under 10 per stream.
5. Grafana Dashboard Permission Issues
In multi-team setups, dashboard permissions must be configured per folder. Without explicit permissions, team members see all dashboards or cannot edit shared ones.
Practice Questions
1. What is the difference between Prometheus and Loki? Prometheus stores and queries metrics (numeric time-series data). Loki stores and queries logs (text-based event data). They are complementary, not competing tools.
2. How does Logstash parse incoming log lines? Logstash uses Grok patterns in filter plugins to match log formats and extract structured fields like timestamp, log level, and message content.
3. Why is Loki more cost-effective than Elasticsearch for logs? Loki stores logs compressed in object storage and only indexes labels (metadata), not the full log content. Elasticsearch indexes full text, requiring more storage and memory.
4. What happens when Prometheus Alertmanager fires an alert? Alertmanager deduplicates, groups, and routes alerts to configured receivers (Slack, PagerDuty, email). It handles silencing, inhibition, and alert fatigue reduction.
5. Challenge: Deploy a complete Observability stack with ELK for log search, Loki for cost-effective log retention, Prometheus for metrics, and Grafana for unified dashboards. Feed sample application logs through Filebeat and verify that a Grafana dashboard shows both metrics and log panels with cross-panel filtering.
Mini Project
Build a production monitoring dashboard for a web application that combines Prometheus metrics (request rate, error rate, latency percentiles), Loki logs (ERROR-level entries with context), and Elasticsearch full-text search for historical log analysis. Configure Alertmanager to notify via Slack when error rate exceeds 5% for 5 consecutive minutes or when p99 latency exceeds 2 seconds.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro