Tracing (OpenTelemetry)
Distributed Tracing, Spans, and Context Propagation.
Distributed Tracing
In a monolith, a stack trace tells you exactly what happened. In microservices, a request hits a Load Balancer, calls Auth, calls Billing, calls the Database, and then fails.
A stack trace only shows you the error inside the last service. Distributed Tracing reconstructs the entire journey.
Concepts
1. The Trace
The representation of a whole request chain. It creates the "Waterfall" view you see in tools like Jaeger.
2. The Span
A single unit of work.
- Root Span: "GET /api/checkout" (Duration: 500ms).
- Child Span: "SQL Query: INSERT orders" (Duration: 50ms).
3. Context Propagation
How does Service B know it belongs to Service A's trace? Headers. Service A injects headers into the HTTP request it sends to B.
- W3C Trace Context: The modern standard (
traceparentheader). - B3: The older Zipkin standard (
X-B3-TraceId).
OpenTelemetry (OTel)
Historically, we had generic agents (Datadog Agent, New Relic Agent). Now, the industry has standardized on OpenTelemetry.
- OTel SDK: A library you import into your code (
import @opentelemetry/api). - OTel Collector: A binary that sits next to your app, receives traces, batches them, and sends them to your backend (Tempo, Honeycomb, Datadog).
Sampling Strategies
Tracing every single request is expensive (CPU + Storage).
- Head-Based Sampling: Decide at the start of the request. "Keep 10% of traces."
- Pro: Simple.
- Con: You might miss the error trace (the 1 in a million bug).
- Tail-Based Sampling: Collect everything in memory at the Collector. If the trace has an error, keep it. If it was successful/fast, discard it.
- Pro: You keep the interesting data.
- Con: High memory usage on the Collector.
Auto-Instrumentation
You often don't need to change code. In Node.js, Java, and Python, OTel provides "Auto-Instrumentation" agents that monkey-patch HTTP libraries to generate spans automatically.