Engineering Playbook

Observability

Metrics (Prometheus)

Counters, Gauges, Histograms, and the Four Golden Signals.

Metrics

Logs tell you why it broke. Traces tell you where it broke. Metrics tell you that it broke.

Metrics are cheap. They are just numbers aggregated over time. You can store millions of data points for very little cost compared to logs.

The Data Model (Prometheus)

Prometheus pulls (scrapes) data from your app.

Metric Types

Counter: Only goes up. (e.g., http_requests_total). Useful for rates (rate(http_requests_total[5m])).
Gauge: Goes up and down. (e.g., memory_usage_bytes, active_goroutines).
Histogram: Buckets data. (e.g., http_request_duration_seconds). Critical for calculating percentiles (P95, P99).

The Four Golden Signals

Google SRE standard. If you measure nothing else, measure these.

Latency: Time it takes to serve a request. (Success vs Failure latency is important).
Traffic: Demand on the system (Req/sec).
Errors: Rate of requests that fail (HTTP 500s).
Saturation: How "full" is the service? (CPU usage, Thread pool capacity).

Alerting Philosophy

"Page on Symptoms, not Causes."

Bad Alert: "CPU is at 90%." (Maybe the machine is just working hard? Who cares if users are happy?)
Good Alert: "High Error Rate" or "High Latency." (Users are suffering).

When the High Latency alert wakes you up, then you look at the dashboard to see "Oh, CPU is at 100%, that's the cause."

CI/CD Pipelines

GitHub Actions, Reusable Workflows, and Trunk-Based Development.

Logs (ELK / Loki)

Structured Logging, Correlation IDs, and Cost Management.

On this page

Metrics The Data Model (Prometheus)Metric Types The Four Golden Signals Alerting Philosophy