Jaeger — Distributed Tracing Guide

DodaTech Updated 2026-06-24 5 min read

In this tutorial, you'll learn about Jaeger. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Jaeger is an open-source distributed tracing platform, originally built by Uber, for monitoring and troubleshooting microservice-based Distributed Systems by tracking requests as they propagate across service boundaries.

What You'll Learn

Why It Matters

In a monolithic application, a slow request is easy to trace — you check one code path. In a microservice architecture, a single user request may traverse 20+ services, each adding latency. Distributed tracing follows that request across all services, showing exactly where time is spent. DodaTech reduced mean time to resolution for performance issues by 70% after adopting Jaeger across all Microservices.

Real-World Use

When DodaZIP users reported slow file uploads, the SRE team opened Jaeger and found a trace showing the upload service spending 4.5 seconds waiting on a virus scan service. The scan service was waiting on a database query with a missing index. The trace showed the exact database call, down to the SQL statement.

flowchart LR
    A[User Request] --> B[API Gateway]
    B --> C[Auth Service]
    B --> D[Upload Service]
    D --> E[Virus Scan]
    D --> F[Storage Service]
    E --> G[DB Query]
    F --> H[S3 Upload]
    G --> I[Trace: 4.5s total]
    G --> J[Span: DB query - 3.8s]
    style I fill:#60C0E0,color:#fff
    style J fill:#FF6B6B,color:#fff

ℹ️ Info

Prerequisites: Basic understanding of microservices and Distributed Systems. Familiarity with Docker for deployment.

Installation

# Deploy Jaeger all-in-one (for evaluation)
docker run -d \
  --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 5778:5778 \
  jaegertracing/all-in-one:1.60

# Expected output:
# Container ID returned

# Verify
curl http://localhost:16686/status

# Expected output:
# {"status":"UP","version":"1.60.0"}

# Access UI at http://localhost:16686

Production Deployment

# docker-compose.yml
version: '3.8'

services:
  collector:
    image: jaegertracing/jaeger-collector:1.60
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "4317:4317"
      - "4318:4318"

  query:
    image: jaegertracing/jaeger-query:1.60
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200
    ports:
      - "16686:16686"

  agent:
    image: jaegertracing/jaeger-agent:1.60
    environment:
      - REPORTER_GRPC_HOST_PORT=collector:14250
    ports:
      - "5778:5778"
      - "6831:6831/udp"
      - "6832:6832/udp"
    depends_on:
      - collector

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.14.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    volumes:
      - es_data:/usr/share/elasticsearch/data

OpenTelemetry Instrumentation

// Node.js application with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');

const provider = new NodeTracerProvider({
  resource: new Resource({
    'service.name': 'user-service',
    'service.version': '2.5.0',
    'deployment.environment': 'production',
  }),
});

const exporter = new OTLPTraceExporter({
  url: 'http://jaeger-collector:4317',
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Create a custom span
const tracer = provider.getTracer('user-service-handler');

async function handleRequest(req, res) {
  const span = tracer.startSpan('POST /users', {
    attributes: {
      'http.method': 'POST',
      'http.route': '/users',
    },
  });

  try {
    const result = await createUser(req.body);
    span.setAttribute('user.id', result.id);
    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    span.recordException(error);
    throw error;
  } finally {
    span.end();
  }
}

// Automatic instrumentation for Express
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');

# Python application with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

provider = TracerProvider()
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://jaeger-collector:4317")
)
provider.addSpanProcessor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

Trace Analysis

# Jaeger API queries
# Search traces by service
curl "http://localhost:16686/api/traces?service=user-service&limit=10"

# Search by tags
curl "http://localhost:16686/api/traces?service=user-service&tags=%7B%22http.status_code%22%3A%22500%22%7D"

# Get trace by ID
curl "http://localhost:16686/api/traces/abc123def456"

# Get service list
curl "http://localhost:16686/api/services"

# Expected output:
# {"data":["api-gateway","user-service","auth-service","postgres"],"total":4}

Service Performance Monitoring

# Jaeger UI > Service Performance Monitoring
# View metrics per service:
# - Red items: error rate per operation
# - Yellow: high latency operations
# - Green: healthy operations

# Key metrics in SPM view:
# - Rate: requests per second per operation
# - Errors: error count and percentage per operation
# - Duration: P50, P95, P99 latency per operation

Sampling Strategies

# Sampling strategies configuration
sampling:
  default_strategy:
    type: probabilistic
    param: 0.001  # Sample 0.1% of all traces by default

  service_strategies:
    - service: user-service
      type: probabilistic
      param: 0.1  # Sample 10% for user-service
      operations:
        - operation: POST /users
          type: probabilistic
          param: 1.0  # Always sample user creation
        - operation: GET /health
          type: rate_limiting
          param: 1  # Max 1 trace per second for health checks

    - service: payment-service
      type: probabilistic
      param: 1.0  # Always sample payment service

Common Configuration Mistakes

Sampling too aggressively in production: 100% sampling creates excessive storage and network costs. Use probabilistic sampling (0.1-1%) with head-based or tail-based sampling for production.
Not instrumenting database calls: The longest spans are often database queries. Missing DB instrumentation hides the most common performance bottleneck. Always instrument your data layer.
Missing error attributes on spans: Span status defaults to OK. Set SpanStatusCode.ERROR with the error message on exception paths, or traces show no errors even when requests fail.
Using inconsistent span names: Span names should follow a consistent pattern (HTTP_METHOD /route or RPC_SERVICE/METHOD). Inconsistent names make analysis and grouping impossible.
Forgetting to propagate context: Context must be passed via HTTP headers (traceparent, tracestate) or gRPC metadata. Without propagation, each service creates a separate trace rather than contributing to the same trace.

Practice Questions

What is a span in distributed tracing? Answer: A span represents a single Unit of Work in a distributed system, with a name, start time, duration, optional parent span, and attributes. Spans form a tree structure called a trace.
What is the purpose of context propagation? Answer: Context propagation passes trace and span IDs across service boundaries via HTTP headers or gRPC metadata, ensuring that spans from different services belong to the same trace.
How does sampling reduce storage costs in Jaeger? Answer: Sampling captures only a fraction of traces (e.g., 1% of requests) while still providing statistically significant performance data. High-value traces (errors, slow requests) can be force-sampled.
What is the difference between head-based and tail-based sampling? Answer: Head-based sampling decides at the start of a trace (before knowing if it will be slow or error). Tail-based sampling decides after the trace completes, allowing selective retention of important traces.

Challenge

Deploy a complete Jaeger tracing pipeline: deploy Jaeger Collector and Query services with Elasticsearch storage, instrument a Node.js or Python microservice with OpenTelemetry (HTTP, database, and external API calls), configure probabilistic sampling (1% default, 100% for errors), create a custom span for a critical business operation, propagate context across HTTP calls, use the Jaeger UI to find the slowest operation in a trace, and set up service performance monitoring.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Elasticsearch & Kibana — Log Analysis & Visualization Guide Next → Sentry — Error Tracking & Performance Monitoring Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Devops Tools