Rollback Procedures
How to plan and execute production rollbacks.
When to Roll Back
| Trigger | Action | Decision Time |
|---|---|---|
| Error rate > 1% | Immediate rollback | < 2 min |
| P95 latency > 2x baseline | Immediate rollback | < 2 min |
| Health checks failing | Immediate rollback | < 1 min |
| Critical bug discovered | Immediate rollback | < 5 min |
| Business metrics down 10%+ | Evaluate, likely rollback | < 15 min |
Pre-Deployment Checklist
Before every deployment, ensure:
- [ ] Previous version artifacts available
- [ ] Rollback procedure documented
- [ ] Database rollback plan ready (if schema changed)
- [ ] Monitoring dashboards visible
- [ ] On-call team notified
- [ ] Communication channels ready
Rollback Steps
Step 1: Detect Issue
# Check error rates
curl -s https://monitoring.example.com/api/errors | jq '.rate'
# Check health endpoints
curl -s https://api.example.com/health
Step 2: Decide to Roll Back
Decision criteria:
- Error rate exceeds threshold
- Latency exceeds threshold
- Health checks failing
- Critical user impact
Step 3: Execute Rollback
For Blue-Green:
# Switch traffic back to blue
kubectl patch service app \
-p '{"spec":{"selector":{"version":"v1.1.0"}}}'
For Rolling Deployment:
For Canary:
Step 4: Verify Rollback
# Check pods are running previous version
kubectl get pods -o jsonpath='{.items[*].spec.containers[0].image}'
# Verify health
curl -s https://api.example.com/health
# Check error rates returning to normal
watch 'curl -s https://monitoring.example.com/api/errors | jq .rate'
Step 5: Communicate
- Update status page
- Notify stakeholders
- Alert support team
- Document incident
Database Rollback Considerations
| Schema Change | Rollback Approach |
|---|---|
| Additive (new columns) | Safe - previous code ignores new columns |
| Destructive (drop columns) | Requires data restore |
| Transformative (rename) | Requires backward-compatible migration |
Best Practice: Use expand-contract migrations
- Expand: Add new column, keep old
- Deploy: New code uses both
- Migrate: Copy data from old to new
- Contract: Remove old column
Feature Flag Rollback
Fastest rollback when using feature flags:
# Disable feature instantly
curl -X PUT https://flags.example.com/api/flags/new-checkout \
-d '{"enabled": false}'
No redeployment required.
Post-Rollback Actions
- Stabilize - Confirm system is healthy
- Investigate - Root cause analysis
- Document - Record incident details
- Fix - Address root cause
- Retest - Validate fix in pre-production
- Redeploy - When ready and tested
Rollback Time Objectives
| Environment | Target RTO |
|---|---|
| Production (critical) | < 5 minutes |
| Production (standard) | < 15 minutes |
| Staging | < 30 minutes |
Practice Rollbacks
Schedule regular rollback drills:
- Monthly: Execute rollback in staging
- Quarterly: Execute rollback in production (during low-traffic)
- Document learnings and update procedures
Next Steps
- Deployment Strategies - Choose the right deployment approach
- Production Deployment - CD Model Stage 10
- Incident Response - Handle production incidents
- Feature Flags - Instant rollback via feature toggles
Tutorials | How-to Guides | Explanation | Reference
You are here: Explanation — understanding-oriented discussion that clarifies concepts.