Observability
Logs (ELK / Loki)
Structured Logging, Correlation IDs, and Cost Management.
Logging
Logs are the highest fidelity data, but also the most expensive.
Structured Logging (JSON)
Stop using console.log("User failed login " + user.id).
Text logs are impossible to query.
Do this:
{
"level": "error",
"msg": "User failed login",
"userId": "123",
"reason": "bad_password",
"traceId": "abc-999"
}Now you can query: count(*) where reason="bad_password".
Correlation IDs
In microservices, a request hits Nginx -> Auth Service -> Backend -> Database. If the Database fails, how do you find the Nginx log that started it?
The Trace ID:
Generate a UUID at the Load Balancer (X-Request-ID). Pass this header to every downstream service. Include it in every log line.
Tools
- ELK (Elasticsearch, Logstash, Kibana): The heavyweight champion. Powerful full-text search. Expensive and hard to manage (Java heap).
- Loki (Grafana): The modern contender. It doesn't index the text of the log, only the labels (
app=frontend). Much cheaper, acts likegrepfor the cloud.
Log Costs
Logging is the #1 cause of surprise cloud bills.
Strategy:
- Sample: Only log 10% of "Info" logs in Prod.
- Levels: Only log "Error" and "Warn" by default. Flip "Debug" on dynamically only when investigating.
- Retention: Keep logs for 7 days hot, then move to S3 (Glacier) for compliance.