Observability and Incident Response
If a service fails and nobody can detect or diagnose it quickly, reliability suffers.
Three pillars
- Logs: explain what happened
- Metrics: show trends and thresholds
- Traces: show cross-service request paths
Incident basics
- Define service level objectives (SLOs).
- Alert on user-impacting symptoms.
- Keep runbooks for common failures.
- Write postmortems with clear action items.