Engineering Playbook
Observability

Logs (ELK / Loki)

Structured Logging, Correlation IDs, and Cost Management.

Logging

Logs are the highest fidelity data, but also the most expensive.

Structured Logging (JSON)

Stop using console.log("User failed login " + user.id). Text logs are impossible to query.

Do this:

{
  "level": "error",
  "msg": "User failed login",
  "userId": "123",
  "reason": "bad_password",
  "traceId": "abc-999"
}

Now you can query: count(*) where reason="bad_password".


Correlation IDs

In microservices, a request hits Nginx -> Auth Service -> Backend -> Database. If the Database fails, how do you find the Nginx log that started it?

The Trace ID: Generate a UUID at the Load Balancer (X-Request-ID). Pass this header to every downstream service. Include it in every log line.


Tools

  • ELK (Elasticsearch, Logstash, Kibana): The heavyweight champion. Powerful full-text search. Expensive and hard to manage (Java heap).
  • Loki (Grafana): The modern contender. It doesn't index the text of the log, only the labels (app=frontend). Much cheaper, acts like grep for the cloud.

Log Costs

Logging is the #1 cause of surprise cloud bills.

Strategy:

  1. Sample: Only log 10% of "Info" logs in Prod.
  2. Levels: Only log "Error" and "Warn" by default. Flip "Debug" on dynamically only when investigating.
  3. Retention: Keep logs for 7 days hot, then move to S3 (Glacier) for compliance.