Distributed Tracing with Jaeger: Monitor Microservices

DodaTech Updated 2026-06-23 6 min read

In this tutorial, you'll learn about Distributed Tracing with Jaeger: Monitor Microservices. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You Will Learn

This tutorial teaches you how to deploy Jaeger for distributed tracing, instrument a microservice application, analyze trace waterfalls, and identify latency bottlenecks using Jaeger UI and its deep-dive features.

Why It Matters

In a monolith, a single profiler shows you where time is spent. In Microservices, a single request can span 10+ services, each adding latency. Jaeger shows you the entire request path in a single view so you can pinpoint exactly which service is slow.

Real-World Use

The DodaTech file sync service was experiencing intermittent slowdowns. Jaeger traces revealed that one specific service was making 15 separate Redis calls per request instead of batching them. The fix reduced request latency from 1.2s to 180ms.

Jaeger is an open-source distributed tracing system originally built by Uber. It is a Cloud Native Computing Foundation graduated project. Jaeger supports the OpenTelemetry protocol (OTLP) for trace ingestion and provides a rich UI for trace search, comparison, and analysis. It can run as a single all-in-one binary or as a scalable production deployment with separate collector, query, and storage components.

Prerequisites

Docker and Docker Compose installed
Python 3.8+ or Node.js 18+
Understanding of OpenTelemetry Tracing basics
A Kubernetes cluster (optional, for production deployment)

Step-by-Step Tutorial

Step 1: Deploy Jaeger All-in-One

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 5778:5778 \
  jaegertracing/all-in-one:1.57

Expected output: Jaeger UI at http://localhost:16686. OTLP gRPC on port 4317, OTLP HTTP on port 4318.

Step 2: Create a Multi-Service Application

Create three files. First, gateway.py:

from flask import Flask
import requests
import time
import random

app = Flask(__name__)

@app.route("/process")
def process():
    time.sleep(random.uniform(0.01, 0.05))
    r1 = requests.get("http://localhost:5001/validate")
    r2 = requests.get("http://localhost:5002/enrich")
    return {"gateway": "ok", "validate": r1.json(), "enrich": r2.json()}

if __name__ == "__main__":
    app.run(port=5000)

Second, validator.py:

from flask import Flask
import time
import random

app = Flask(__name__)

@app.route("/validate")
def validate():
    time.sleep(random.uniform(0.05, 0.3))
    return {"valid": True, "score": random.randint(0, 100)}

if __name__ == "__main__":
    app.run(port=5001)

Third, enricher.py:

from flask import Flask
import time
import random

app = Flask(__name__)

@app.route("/enrich")
def enrich():
    time.sleep(random.uniform(0.1, 0.8))
    return {"enriched": True, "tags": ["user", "premium", "beta"]}

if __name__ == "__main__":
    app.run(port=5002)

Step 3: Instrument All Services with OpenTelemetry

Install dependencies:

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-instrumentation-flask \
  opentelemetry-instrumentation-requests \
  opentelemetry-exporter-otlp-proto-grpc

Create tracing.py (shared by all services):

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import os

service_name = os.getenv("SERVICE_NAME", "unknown")

provider = TracerProvider()
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://localhost:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

def instrument_app(app):
    FlaskInstrumentor().instrument_app(app)
    RequestsInstrumentor().instrument()

Update each service to import and call instrument_app(app).

Step 4: Set Service Names via Environment Variables

SERVICE_NAME=gateway python gateway.py &
SERVICE_NAME=validator python validator.py &
SERVICE_NAME=enricher python enricher.py &

Step 5: Generate Load

for i in $(seq 1 50); do
  curl http://localhost:5000/process
  sleep 0.5
done

Step 6: Search Traces in Jaeger UI

Open http://localhost:16686
In the Search panel, select gateway as the Service
Click Find Traces
Click on any trace to see the waterfall

Step 7: Analyze the Trace Waterfall

In the trace detail view, look at:

Service graph: A topology view showing service dependencies
Span timeline: Each span's duration shown as a horizontal bar
Tags and logs: Click on any span to see its attributes and events

Step 8: Compare Traces

Jaeger can compare two traces side-by-side:

Select two traces from the search results
Click "Compare" button
See the delayed trace vs the normal trace offset

Step 9: Deploy Jaeger Production Architecture

services:
  jaeger-collector:
    image: jaegertracing/jaeger-collector:1.57
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
    ports:
      - "4317:4317"
      - "14250:14250"

  jaeger-query:
    image: jaegertracing/jaeger-query:1.57
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
    ports:
      - "16686:16686"

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.13.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false

Learning Path

flowchart LR
    A[Deploy Jaeger] --> B[Instrument Services]
    B --> C[Generate Traffic]
    C --> D[Search Traces]
    D --> E[Analyze Waterfall]
    E --> F[Find Bottlenecks]
    E -.-> G[Compare Traces]
    E -.-> H[Service Dependencies]
    style A fill:#4a90d9,color:#fff
    style F fill:#e67e22,color:#fff

Common Errors

No traces appear in Jaeger -- The OTLP endpoint is unreachable. Verify Jaeger collector is listening on port 4317 with docker logs jaeger.
Traces appear but spans are disconnected -- Context propagation is not working. Ensure opentelemetry-instrumentation-requests is installed on the calling service.
Jaeger UI shows "No data" for service dropdown -- The index is empty. It takes a few seconds for traces to be indexed after ingestion.
Elasticsearch storage fails to connect -- The environment variable SPAN_STORAGE_TYPE is not set to elasticsearch on both collector and query services.
Sampling rate is too high causing performance issues -- Set OTEL_TRACES_SAMPLER=parentbased_traceidratio and OTEL_TRACES_SAMPLER_ARG=0.1 to sample only 10% of requests.
Trace shows only one service -- Downstream services are not instrumented or the instrumentation is not applied. Check each service's startup logs.
gRPC endpoint refuses connection -- The Jaeger all-in-one image must have COLLECTOR_OTLP_ENABLED=true to accept OTLP gRPC traffic.

Practice Questions

What is a trace in Jaeger? Answer: A trace is the complete path of a single request through a distributed system, composed of multiple spans.
How does Jaeger store trace data? Answer: Jaeger supports in-memory storage (all-in-one), Elasticsearch, Cassandra, and Kafka as storage backends.
What is the purpose of the Jaeger Query service? Answer: The Query service provides an API and web UI for searching, retrieving, and visualizing traces from the storage backend.
How does sampling affect trace collection? Answer: Sampling reduces the number of collected traces to control storage costs and performance overhead while maintaining statistical significance.
What is the difference between Jaeger all-in-one and production deployment? Answer: All-in-one combines agent, collector, query, and UI in a single process for development. Production deployment separates them for scalability and reliability.

Challenge

Deploy Jaeger with Elasticsearch storage backend. Build a four-service application (API gateway, user service, order service, payment service) where each service makes calls to the next. Instrument all services with OpenTelemetry and ensure context propagation works end-to-end. Generate 200 requests and verify all traces appear in Jaeger with the correct span hierarchy. Add a simulated 2-second delay to the payment service and verify that Jaeger's trace waterfall clearly identifies it as the bottleneck. Use Jaeger's compare feature to show the difference between a normal request and a slow request. Export the service dependency graph.

FAQ

What is the difference between Jaeger and Zipkin?

Both are distributed tracing systems. Jaeger offers richer UI features (dependency graph, trace comparison) and broader storage backend support. Zipkin is simpler and lighter.

Can I use Jaeger with OpenTelemetry?

Yes, Jaeger natively supports the OpenTelemetry Protocol (OTLP). You can send traces directly to Jaeger using OTLP exporters.

Does Jaeger support metrics or logs?

Jaeger is focused on traces. For a complete Observability stack, combine Jaeger with Prometheus (metrics) and Loki (logs).

Is Jaeger suitable for production use?

Yes, Jaeger is a CNCF graduated project used in production at Uber, Red Hat, and many other organizations.

How long does Jaeger retain traces?

Retention depends on the storage backend. In-memory: a few minutes. Elasticsearch: configurable per index policy (typically 7-30 days).

← Previous Custom Metrics with Prometheus Exporters: Build Your Own Next → Monitoring Web Applications: RUM and Synthetic Monitoring

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Observability