Jaeger â Distributed Tracing Guide
In this tutorial, you'll learn about Jaeger. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Jaeger is an open-source distributed tracing platform, originally built by Uber, for monitoring and troubleshooting microservice-based Distributed Systems by tracking requests as they propagate across service boundaries.
What You'll Learn
Why It Matters
In a monolithic application, a slow request is easy to trace â you check one code path. In a microservice architecture, a single user request may traverse 20+ services, each adding latency. Distributed tracing follows that request across all services, showing exactly where time is spent. DodaTech reduced mean time to resolution for performance issues by 70% after adopting Jaeger across all Microservices.
Real-World Use
When DodaZIP users reported slow file uploads, the SRE team opened Jaeger and found a trace showing the upload service spending 4.5 seconds waiting on a virus scan service. The scan service was waiting on a database query with a missing index. The trace showed the exact database call, down to the SQL statement.
flowchart LR
A[User Request] --> B[API Gateway]
B --> C[Auth Service]
B --> D[Upload Service]
D --> E[Virus Scan]
D --> F[Storage Service]
E --> G[DB Query]
F --> H[S3 Upload]
G --> I[Trace: 4.5s total]
G --> J[Span: DB query - 3.8s]
style I fill:#60C0E0,color:#fff
style J fill:#FF6B6B,color:#fff
Prerequisites: Basic understanding of microservices and Distributed Systems. Familiarity with Docker for deployment.
Installation
# Deploy Jaeger all-in-one (for evaluation)
docker run -d \
--name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 5778:5778 \
jaegertracing/all-in-one:1.60
# Expected output:
# Container ID returned
# Verify
curl http://localhost:16686/status
# Expected output:
# {"status":"UP","version":"1.60.0"}
# Access UI at http://localhost:16686
Production Deployment
# docker-compose.yml
version: '3.8'
services:
collector:
image: jaegertracing/jaeger-collector:1.60
environment:
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
- COLLECTOR_OTLP_ENABLED=true
ports:
- "4317:4317"
- "4318:4318"
query:
image: jaegertracing/jaeger-query:1.60
environment:
- SPAN_STORAGE_TYPE=elasticsearch
- ES_SERVER_URLS=http://elasticsearch:9200
ports:
- "16686:16686"
agent:
image: jaegertracing/jaeger-agent:1.60
environment:
- REPORTER_GRPC_HOST_PORT=collector:14250
ports:
- "5778:5778"
- "6831:6831/udp"
- "6832:6832/udp"
depends_on:
- collector
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.14.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms1g -Xmx1g
volumes:
- es_data:/usr/share/elasticsearch/data
OpenTelemetry Instrumentation
// Node.js application with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const provider = new NodeTracerProvider({
resource: new Resource({
'service.name': 'user-service',
'service.version': '2.5.0',
'deployment.environment': 'production',
}),
});
const exporter = new OTLPTraceExporter({
url: 'http://jaeger-collector:4317',
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Create a custom span
const tracer = provider.getTracer('user-service-handler');
async function handleRequest(req, res) {
const span = tracer.startSpan('POST /users', {
attributes: {
'http.method': 'POST',
'http.route': '/users',
},
});
try {
const result = await createUser(req.body);
span.setAttribute('user.id', result.id);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
}
// Automatic instrumentation for Express
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
# Python application with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
provider = TracerProvider()
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://jaeger-collector:4317")
)
provider.addSpanProcessor(processor)
trace.set_tracer_provider(provider)
# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
Trace Analysis
# Jaeger API queries
# Search traces by service
curl "http://localhost:16686/api/traces?service=user-service&limit=10"
# Search by tags
curl "http://localhost:16686/api/traces?service=user-service&tags=%7B%22http.status_code%22%3A%22500%22%7D"
# Get trace by ID
curl "http://localhost:16686/api/traces/abc123def456"
# Get service list
curl "http://localhost:16686/api/services"
# Expected output:
# {"data":["api-gateway","user-service","auth-service","postgres"],"total":4}
Service Performance Monitoring
# Jaeger UI > Service Performance Monitoring
# View metrics per service:
# - Red items: error rate per operation
# - Yellow: high latency operations
# - Green: healthy operations
# Key metrics in SPM view:
# - Rate: requests per second per operation
# - Errors: error count and percentage per operation
# - Duration: P50, P95, P99 latency per operation
Sampling Strategies
# Sampling strategies configuration
sampling:
default_strategy:
type: probabilistic
param: 0.001 # Sample 0.1% of all traces by default
service_strategies:
- service: user-service
type: probabilistic
param: 0.1 # Sample 10% for user-service
operations:
- operation: POST /users
type: probabilistic
param: 1.0 # Always sample user creation
- operation: GET /health
type: rate_limiting
param: 1 # Max 1 trace per second for health checks
- service: payment-service
type: probabilistic
param: 1.0 # Always sample payment service
Common Configuration Mistakes
Sampling too aggressively in production: 100% sampling creates excessive storage and network costs. Use probabilistic sampling (0.1-1%) with head-based or tail-based sampling for production.
Not instrumenting database calls: The longest spans are often database queries. Missing DB instrumentation hides the most common performance bottleneck. Always instrument your data layer.
Missing error attributes on spans: Span status defaults to OK. Set
SpanStatusCode.ERRORwith the error message on exception paths, or traces show no errors even when requests fail.Using inconsistent span names: Span names should follow a consistent pattern (
HTTP_METHOD /routeorRPC_SERVICE/METHOD). Inconsistent names make analysis and grouping impossible.Forgetting to propagate context: Context must be passed via HTTP headers (
traceparent,tracestate) or gRPC metadata. Without propagation, each service creates a separate trace rather than contributing to the same trace.
Practice Questions
What is a span in distributed tracing? Answer: A span represents a single Unit of Work in a distributed system, with a name, start time, duration, optional parent span, and attributes. Spans form a tree structure called a trace.
What is the purpose of context propagation? Answer: Context propagation passes trace and span IDs across service boundaries via HTTP headers or gRPC metadata, ensuring that spans from different services belong to the same trace.
How does sampling reduce storage costs in Jaeger? Answer: Sampling captures only a fraction of traces (e.g., 1% of requests) while still providing statistically significant performance data. High-value traces (errors, slow requests) can be force-sampled.
What is the difference between head-based and tail-based sampling? Answer: Head-based sampling decides at the start of a trace (before knowing if it will be slow or error). Tail-based sampling decides after the trace completes, allowing selective retention of important traces.
Challenge
Deploy a complete Jaeger tracing pipeline: deploy Jaeger Collector and Query services with Elasticsearch storage, instrument a Node.js or Python microservice with OpenTelemetry (HTTP, database, and external API calls), configure probabilistic sampling (1% default, 100% for errors), create a custom span for a critical business operation, propagate context across HTTP calls, use the Jaeger UI to find the slowest operation in a trace, and set up service performance monitoring.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro