Engineering Playbook
Observability

Tracing (OpenTelemetry)

Distributed Tracing, Spans, and Context Propagation.

Distributed Tracing

In a monolith, a stack trace tells you exactly what happened. In microservices, a request hits a Load Balancer, calls Auth, calls Billing, calls the Database, and then fails.

A stack trace only shows you the error inside the last service. Distributed Tracing reconstructs the entire journey.

Concepts

1. The Trace

The representation of a whole request chain. It creates the "Waterfall" view you see in tools like Jaeger.

2. The Span

A single unit of work.

  • Root Span: "GET /api/checkout" (Duration: 500ms).
  • Child Span: "SQL Query: INSERT orders" (Duration: 50ms).

3. Context Propagation

How does Service B know it belongs to Service A's trace? Headers. Service A injects headers into the HTTP request it sends to B.

  • W3C Trace Context: The modern standard (traceparent header).
  • B3: The older Zipkin standard (X-B3-TraceId).

OpenTelemetry (OTel)

Historically, we had generic agents (Datadog Agent, New Relic Agent). Now, the industry has standardized on OpenTelemetry.

  • OTel SDK: A library you import into your code (import @opentelemetry/api).
  • OTel Collector: A binary that sits next to your app, receives traces, batches them, and sends them to your backend (Tempo, Honeycomb, Datadog).

Sampling Strategies

Tracing every single request is expensive (CPU + Storage).

  1. Head-Based Sampling: Decide at the start of the request. "Keep 10% of traces."
    • Pro: Simple.
    • Con: You might miss the error trace (the 1 in a million bug).
  2. Tail-Based Sampling: Collect everything in memory at the Collector. If the trace has an error, keep it. If it was successful/fast, discard it.
    • Pro: You keep the interesting data.
    • Con: High memory usage on the Collector.

Auto-Instrumentation

You often don't need to change code. In Node.js, Java, and Python, OTel provides "Auto-Instrumentation" agents that monkey-patch HTTP libraries to generate spans automatically.