Observability
Metrics (Prometheus)
Counters, Gauges, Histograms, and the Four Golden Signals.
Metrics
Logs tell you why it broke. Traces tell you where it broke. Metrics tell you that it broke.
Metrics are cheap. They are just numbers aggregated over time. You can store millions of data points for very little cost compared to logs.
The Data Model (Prometheus)
Prometheus pulls (scrapes) data from your app.
Metric Types
- Counter: Only goes up. (e.g.,
http_requests_total). Useful for rates (rate(http_requests_total[5m])). - Gauge: Goes up and down. (e.g.,
memory_usage_bytes,active_goroutines). - Histogram: Buckets data. (e.g.,
http_request_duration_seconds). Critical for calculating percentiles (P95, P99).
The Four Golden Signals
Google SRE standard. If you measure nothing else, measure these.
- Latency: Time it takes to serve a request. (Success vs Failure latency is important).
- Traffic: Demand on the system (Req/sec).
- Errors: Rate of requests that fail (HTTP 500s).
- Saturation: How "full" is the service? (CPU usage, Thread pool capacity).
Alerting Philosophy
"Page on Symptoms, not Causes."
- Bad Alert: "CPU is at 90%." (Maybe the machine is just working hard? Who cares if users are happy?)
- Good Alert: "High Error Rate" or "High Latency." (Users are suffering).
When the High Latency alert wakes you up, then you look at the dashboard to see "Oh, CPU is at 100%, that's the cause."