Incident Response
How to respond to production incidents effectively.
Severity Levels
| Level | Description | Ex. SLA/SLI Response Time | Examples |
|---|---|---|---|
| P1 - Critical | Service down, data loss | < 15 min | Complete outage, security breach |
| P2 - High | Major feature broken | < 1 hour | Payment processing failing |
| P3 - Medium | Degraded performance | < 4 hours | Slow response times |
| P4 - Low | Minor issue | < 24 hours | UI glitch, typo |
Response Steps
1. Detect
- Automated: Monitoring alerts, health check failures, error rate thresholds
- Manual: User reports, support tickets, internal observation
2. Triage
| Question | Impact Assessment |
|---|---|
| How many users affected? | All / Many / Few |
| Is data at risk? | Yes / No |
| Are transactions failing? | Yes / No |
| Is there a workaround? | Yes / No |
3. Communicate
Update status page immediately for P1/P2. Include: status, impact, start time, next update time.
4. Respond
| Action | When |
|---|---|
| Rollback | Deployment caused issue |
| Disable feature flag | Feature causing issue |
| Scale resources | Capacity issue |
| Restart services | Transient failure |
| Failover | Region/zone failure |
5. Resolve
- Verify - Confirm metrics returned to normal
- Monitor - Watch for recurrence (30+ minutes)
- Update status - Mark incident resolved
- Notify - Inform all stakeholders
6. Post-Mortem
Within 48 hours, document: summary, timeline, root cause, impact, what went well, what went wrong, action items.
On-Call Responsibilities
| Responsibility | Target |
|---|---|
| Acknowledge alerts | < 5 min |
| Initial triage | < 10 min |
| Escalate if blocked | Immediately |
| Document all actions | Real-time |
| Hand off properly | Brief next person |
Communication Templates
- Investigating: "We're investigating reports of [issue]. More updates to follow."
- Identified: "We've identified the issue as [cause]. Working on a fix."
- Monitoring: "Fix has been deployed. Monitoring for stability."
- Resolved: "The issue has been resolved. Services are operating normally."
External Resources
- Google SRE Book - Incident Management
- PagerDuty Incident Response Guide
- Atlassian Incident Management
Next Steps
- Rollback Procedures - Execute emergency rollbacks
- CD Model Stage 11 - Live monitoring
- Deployment Strategies - Deployment patterns
Tutorials | How-to Guides | Explanation | Reference
You are here: Explanation — understanding-oriented discussion that clarifies concepts.