Incident Response

Automated: Monitoring alerts, health check failures, error rate thresholds
Manual: User reports, support tickets, internal observation

How to respond to production incidents effectively.

Severity Levels

Level	Description	Ex. SLA/SLI Response Time	Examples
P1 - Critical	Service down, data loss	< 15 min	Complete outage, security breach
P2 - High	Major feature broken	< 1 hour	Payment processing failing
P3 - Medium	Degraded performance	< 4 hours	Slow response times
P4 - Low	Minor issue	< 24 hours	UI glitch, typo

Update status page immediately for P1/P2. Include: status, impact, start time, next update time.

Within 48 hours, document: summary, timeline, root cause, impact, what went well, what went wrong, action items.

Investigating: "We're investigating reports of [issue]. More updates to follow."
Identified: "We've identified the issue as [cause]. Working on a fix."
Monitoring: "Fix has been deployed. Monitoring for stability."
Resolved: "The issue has been resolved. Services are operating normally."

You are here: Explanation — understanding-oriented discussion that clarifies concepts.