Distributed Tracing with Jaeger: Monitor Microservices
In this tutorial, you'll learn about Distributed Tracing with Jaeger: Monitor Microservices. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You Will Learn
This tutorial teaches you how to deploy Jaeger for distributed tracing, instrument a microservice application, analyze trace waterfalls, and identify latency bottlenecks using Jaeger UI and its deep-dive features.
Why It Matters
In a monolith, a single profiler shows you where time is spent. In Microservices, a single request can span 10+ services, each adding latency. Jaeger shows you the entire request path in a single view so you can pinpoint exactly which service is slow.
Real-World Use
The DodaTech file sync service was experiencing intermittent slowdowns. Jaeger traces revealed that one specific service was making 15 separate Redis calls per request instead of batching them. The fix reduced request latency from 1.2s to 180ms.
Jaeger is an open-source distributed tracing system originally built by Uber. It is a Cloud Native Computing Foundation graduated project. Jaeger supports the OpenTelemetry protocol (OTLP) for trace ingestion and provides a rich UI for trace search, comparison, and analysis. It can run as a single all-in-one binary or as a scalable production deployment with separate collector, query, and storage components.
Prerequisites
- Docker and Docker Compose installed
- Python 3.8+ or Node.js 18+
- Understanding of OpenTelemetry Tracing basics
- A Kubernetes cluster (optional, for production deployment)
Step-by-Step Tutorial
Step 1: Deploy Jaeger All-in-One
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 5778:5778 \
jaegertracing/all-in-one:1.57
Expected output: Jaeger UI at http://localhost:16686. OTLP gRPC on port 4317, OTLP HTTP on port 4318.
Step 2: Create a Multi-Service Application
Create three files. First, gateway.py:
from flask import Flask
import requests
import time
import random
app = Flask(__name__)
@app.route("/process")
def process():
time.sleep(random.uniform(0.01, 0.05))
r1 = requests.get("http://localhost:5001/validate")
r2 = requests.get("http://localhost:5002/enrich")
return {"gateway": "ok", "validate": r1.json(), "enrich": r2.json()}
if __name__ == "__main__":
app.run(port=5000)
Second, validator.py:
from flask import Flask
import time
import random
app = Flask(__name__)
@app.route("/validate")
def validate():
time.sleep(random.uniform(0.05, 0.3))
return {"valid": True, "score": random.randint(0, 100)}
if __name__ == "__main__":
app.run(port=5001)
Third, enricher.py:
from flask import Flask
import time
import random
app = Flask(__name__)
@app.route("/enrich")
def enrich():
time.sleep(random.uniform(0.1, 0.8))
return {"enriched": True, "tags": ["user", "premium", "beta"]}
if __name__ == "__main__":
app.run(port=5002)
Step 3: Instrument All Services with OpenTelemetry
Install dependencies:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests \
opentelemetry-exporter-otlp-proto-grpc
Create tracing.py (shared by all services):
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import os
service_name = os.getenv("SERVICE_NAME", "unknown")
provider = TracerProvider()
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://localhost:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
def instrument_app(app):
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
Update each service to import and call instrument_app(app).
Step 4: Set Service Names via Environment Variables
SERVICE_NAME=gateway python gateway.py &
SERVICE_NAME=validator python validator.py &
SERVICE_NAME=enricher python enricher.py &
Step 5: Generate Load
for i in $(seq 1 50); do
curl http://localhost:5000/process
sleep 0.5
done
Step 6: Search Traces in Jaeger UI
- Open
http://localhost:16686 - In the Search panel, select
gatewayas the Service - Click Find Traces
- Click on any trace to see the waterfall
Step 7: Analyze the Trace Waterfall
In the trace detail view, look at:
- Service graph: A topology view showing service dependencies
- Span timeline: Each span's duration shown as a horizontal bar
- Tags and logs: Click on any span to see its attributes and events
Step 8: Compare Traces
Jaeger can compare two traces side-by-side:
- Select two traces from the search results
- Click "Compare" button
- See the delayed trace vs the normal trace offset
Step 9: Deploy Jaeger Production Architecture
services:
jaeger-collector:
image: jaegertracing/jaeger-collector:1.57
environment:
- SPAN_STORAGE_TYPE=elasticsearch
ports:
- "4317:4317"
- "14250:14250"
jaeger-query:
image: jaegertracing/jaeger-query:1.57
environment:
- SPAN_STORAGE_TYPE=elasticsearch
ports:
- "16686:16686"
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.13.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
Learning Path
flowchart LR
A[Deploy Jaeger] --> B[Instrument Services]
B --> C[Generate Traffic]
C --> D[Search Traces]
D --> E[Analyze Waterfall]
E --> F[Find Bottlenecks]
E -.-> G[Compare Traces]
E -.-> H[Service Dependencies]
style A fill:#4a90d9,color:#fff
style F fill:#e67e22,color:#fff
Common Errors
No traces appear in Jaeger -- The OTLP endpoint is unreachable. Verify Jaeger collector is listening on port 4317 with
docker logs jaeger.Traces appear but spans are disconnected -- Context propagation is not working. Ensure
opentelemetry-instrumentation-requestsis installed on the calling service.Jaeger UI shows "No data" for service dropdown -- The index is empty. It takes a few seconds for traces to be indexed after ingestion.
Elasticsearch storage fails to connect -- The environment variable
SPAN_STORAGE_TYPEis not set toelasticsearchon both collector and query services.Sampling rate is too high causing performance issues -- Set
OTEL_TRACES_SAMPLER=parentbased_traceidratioandOTEL_TRACES_SAMPLER_ARG=0.1to sample only 10% of requests.Trace shows only one service -- Downstream services are not instrumented or the instrumentation is not applied. Check each service's startup logs.
gRPC endpoint refuses connection -- The Jaeger all-in-one image must have
COLLECTOR_OTLP_ENABLED=trueto accept OTLP gRPC traffic.
Practice Questions
What is a trace in Jaeger? Answer: A trace is the complete path of a single request through a distributed system, composed of multiple spans.
How does Jaeger store trace data? Answer: Jaeger supports in-memory storage (all-in-one), Elasticsearch, Cassandra, and Kafka as storage backends.
What is the purpose of the Jaeger Query service? Answer: The Query service provides an API and web UI for searching, retrieving, and visualizing traces from the storage backend.
How does sampling affect trace collection? Answer: Sampling reduces the number of collected traces to control storage costs and performance overhead while maintaining statistical significance.
What is the difference between Jaeger all-in-one and production deployment? Answer: All-in-one combines agent, collector, query, and UI in a single process for development. Production deployment separates them for scalability and reliability.
Challenge
Deploy Jaeger with Elasticsearch storage backend. Build a four-service application (API gateway, user service, order service, payment service) where each service makes calls to the next. Instrument all services with OpenTelemetry and ensure context propagation works end-to-end. Generate 200 requests and verify all traces appear in Jaeger with the correct span hierarchy. Add a simulated 2-second delay to the payment service and verify that Jaeger's trace waterfall clearly identifies it as the bottleneck. Use Jaeger's compare feature to show the difference between a normal request and a slow request. Export the service dependency graph.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro