[ PROMPT_NODE_24907 ]

Incident Management

[ SKILL_DOCUMENTATION ]

# Incident Management Comprehensive guide to incident response, root cause analysis, post-mortems, and building resilient incident management processes. ## Table of Contents - [Incident Lifecycle](#incident-lifecycle) - [Severity Classification](#severity-classification) - [Incident Response Roles](#incident-response-roles) - [Communication Protocols](#communication-protocols) - [Root Cause Analysis](#root-cause-analysis) - [Post-Incident Reviews](#post-incident-reviews) - [On-Call Management](#on-call-management) - [Incident Management Tools](#incident-management-tools) - [Runbook Development](#runbook-development) - [Metrics and Improvement](#metrics-and-improvement) ## Incident Lifecycle ### Incident States ``` DETECTED → ACKNOWLEDGED → INVESTIGATING → IDENTIFIED → RESOLVING → RESOLVED → CLOSED Detection: - Automated alert fired - User report received - Proactive monitoring identified anomaly Acknowledgment: - On-call engineer confirms receipt - Target: < 5 minutes for P1 Investigation: - Gather symptoms and evidence - Check recent changes - Review logs and metrics - Identify affected components Identification: - Root cause hypothesis formed - Scope of impact determined - Fix or workaround identified Resolution: - Implement fix or workaround - Validate service restoration - Monitor for recurrence Closure: - Confirm no further impact - Document resolution - Schedule post-incident review ``` ### Incident Response Workflow ```yaml Phase 1: Detection & Triage (0-5 minutes) Actions: - Alert fires or user reports issue - On-call acknowledges within 5 minutes - Initial severity assessment - Create incident ticket - Page additional responders if needed Key Questions: - What is the user-facing impact? - How many users/customers affected? - Is data at risk? - Is this getting worse? Phase 2: Investigation (5-30 minutes) Actions: - Join incident war room (Slack/Zoom) - Assign Incident Commander role - Review recent changes (deployments, configs) - Check monitoring dashboards - Query logs for errors - Trace affected requests - Form initial hypothesis Key Questions: - What changed recently? - What do logs show? - Are dependencies healthy? - Can we reproduce the issue? Phase 3: Mitigation (30-60 minutes) Actions: - Implement fix or workaround - Roll back recent changes if needed - Scale resources if capacity issue - Enable feature flags to disable problematic code - Communicate status to stakeholders Decision Framework: - Fast workaround vs proper fix - Rollback vs roll forward - Partial restoration vs full fix Phase 4: Recovery (60+ minutes) Actions: - Validate service metrics returned to normal - Confirm user-facing functionality restored - Monitor for recurrence (30-60 min) - Gradual rollback of workarounds if applied - Clear alerts Success Criteria: - Error rate $10K/hour, major SLA breach Data Risk: Active data loss or corruption Response: - Acknowledge: < 5 minutes - Initial Response: Immediate (24/7) - Communication: Every 30 minutes - Escalation: Immediate to leadership - All hands on deck: Yes Examples: - Website completely down (503 errors) - Database unavailable - Payment processing offline - Security breach in progress - Data corruption affecting production Priority 2 (High): Description: Major functionality degraded or affecting many users User Impact: Significant portion of users impacted Business Impact: Revenue loss $1K-$10K/hour, SLA at risk Data Risk: Potential data issues if not resolved Response: - Acknowledge: < 15 minutes - Initial Response: 5%) Priority 3 (Medium): Description: Partial degradation or minor functionality impaired User Impact: Some users affected, workaround available Business Impact: Minimal revenue impact Data Risk: No data at risk Response: - Acknowledge: < 4 hours (business hours) - Initial Response: Same business day - Communication: Daily updates - Escalation: If not resolved in 1 business day - All hands: No Examples: - Non-critical feature broken - Performance degradation in secondary service - Intermittent errors affecting < 1% of requests - UI display issues Priority 4 (Low): Description: Minor issues, cosmetic problems, or enhancement requests User Impact: Minimal or no user impact Business Impact: No business impact Data Risk: None Response: - Acknowledge: Timeout ↑ ↑ ↑ │ │ ┌────┴────┐ Deploy Connection │ OR │ Failed Pool Full ┌───┴───┐ ┌───┴───┐ ↑ │ │ │ │ Timeout Slow Database Too High Query Connection Pool Size Too Small ROOT CAUSE PATH (highlighted): Website Unavailable ← Backend Servers Down ← All Servers Unhealthy ← Health Check Failing ← Response Time > Timeout ← Connection Pool Full ← Timeout Too High ← Deploy Failed (config error) ``` ### Timeline Analysis ```yaml Incident Timeline: 2025-01-15 API Outage 14:28:00 - Deployment started (v2.3.1) 14:30:00 - Deployment completed successfully 14:32:00 - First spike in error rate (5% → 20%) 14:32:30 - Error rate continues climbing (20% → 50%) 14:33:00 - Complete outage (100% errors) 14:33:15 - Automated alert fired: "High error rate" 14:33:45 - PagerDuty page sent to on-call 14:35:00 - On-call acknowledged alert [MTTA: 1m 45s] 14:36:00 - Incident declared P1, war room created 14:37:00 - Dashboard review shows all backend servers marked unhealthy 14:38:00 - Health check logs show timeouts 14:40:00 - Application logs show slow responses on /health endpoint 14:42:00 - Database connection pool exhaustion identified 14:45:00 - Recent deployment suspected 14:48:00 - Config diff reveals timeout change: 5s → 60s 14:50:00 - Decision made to rollback 14:52:00 - Rollback initiated 14:55:00 - Rollback completed 14:56:00 - Error rate dropping (100% → 10%) 14:58:00 - Error rate normal ( app timeout - Document timeout hierarchy - Automate timeout configuration - [ ] **Implement gradual rollout** [Owner: Platform Team] [Due: 2025-02-28] - Canary deployments (10% → 50% → 100%) - Automatic rollback on error rate increase - Deployment gates based on metrics ### Long-term (3 months) - [ ] **Separate config deployment pipeline** [Owner: Architecture Team] [Due: 2025-04-15] - Configuration changes reviewed by ops team - Gradual rollout for config changes - Different approval process than code - [ ] **Implement synthetic monitoring** [Owner: Observability Team] [Due: 2025-04-15] - Proactive health checks from external monitors - Alert before complete outage - Detect gradual degradation earlier - [ ] **Capacity planning framework** [Owner: SRE Team] [Due: 2025-04-30] - Document sizing guidelines for connection pools - Load testing requirements - Automated capacity recommendations ## Lessons Learned 1. **Configuration is Code**: Configuration changes should be treated with the same rigor as code changes, including review, testing, and validation. 2. **Test in Staging**: A production-like staging environment would have caught this issue before it reached production. 3. **Cascading Failures**: Small changes (timeout adjustment) can have large, unexpected effects. Better understanding of system interactions is needed. 4. **Alerting Gaps**: We alerted on symptoms (errors) but not leading indicators (connection pool saturation). Adding more proactive monitoring would enable earlier intervention. 5. **Response Worked Well**: Despite the outage, our incident response process performed admirably. Clear roles, good communication, and decisive action led to fast resolution. ## Appendix ### Supporting Data - [Link to Grafana Dashboard during incident] - [Link to error logs] - [Link to deployment change log] - [Link to war room Slack thread] ### Glossary - **MTTA**: Mean Time to Acknowledge - **MTTR**: Mean Time to Recovery - **SLO**: Service Level Objective ### Related Incidents - 2024-11-03: Database connection pool exhaustion (different cause) - 2024-09-12: Health check timeout issues on Redis ### Review Attendees - Jane Doe (Incident Commander) - John Smith (Technical Lead) - Alice Johnson (Engineering Manager) - Bob Wilson (Product Manager) - Carol Martinez (Customer Success) ``` ### Post-Incident Review Meeting Agenda ```yaml Duration: 60 minutes 00:00-00:05 (5 min): Intro and Ground Rules - Reminder: Blameless, focus on learning - Goal: Improve systems and processes - Everyone encouraged to participate 00:05-00:15 (10 min): Timeline Walkthrough - Incident Commander presents timeline - Highlight key events - Clarifying questions only (no analysis yet) 00:15-00:25 (10 min): What Went Well - Celebrate successes - What should we keep doing? - What practices helped us respond effectively? 00:25-00:40 (15 min): What Could Be Improved - Open discussion - What could we have done differently? - What prevented faster detection/resolution? - Surface systemic issues 00:40-00:55 (15 min): Action Items - Brainstorm improvements - Prioritize by impact and effort - Assign owners and due dates - Ensure items are specific and actionable 00:55-01:00 (5 min): Wrap-up - Review action items - Schedule follow-ups - Thank participants ``` ## On-Call Management ### On-Call Rotation Best Practices ```yaml Rotation Schedule: Primary On-Call: Duration: 1 week (Monday-Monday) Responsibilities: First responder for all alerts Compensation: Stipend + time off Secondary On-Call: Duration: 1 week (Monday-Monday) Responsibilities: Backup for primary, escalation target Compensation: Stipend Rotation Size: Minimum: 4 engineers (2 weeks between shifts) Recommended: 6-8 engineers (4-6 weeks between shifts) Maximum: 12 engineers (risk of losing context) On-Call Eligibility: Requirements: - Completed onboarding (30+ days) - Shadowed 2+ on-call shifts - Can access production systems - Familiar with monitoring and alerting - Completed incident response training Opt-out Reasons: - Vacation (blackout dates) - Personal circumstances - Heavy project deadlines (pre-approved) Schedule Management: Tool: PagerDuty, OpsGenie, or similar Visibility: Published 6 weeks in advance Changes: Self-service swap with approval Coverage: 24/7 for P1/P2, business hours for P3/P4 Escalation Policy: Level 1: Primary On-Call (0-5 min) Level 2: Secondary On-Call (5-15 min) Level 3: Team Lead (15-30 min) Level 4: Engineering Manager (30-60 min) Level 5: VP Engineering (60+ min, P1 only) ``` ### On-Call Runbook ```markdown # On-Call Engineer Guide ## Before Your Shift ### 48 Hours Before - [ ] Review the on-call schedule - [ ] Identify your backup (secondary on-call) - [ ] Block calendar for any potential incident response - [ ] Review recent incidents and ongoing issues ### 24 Hours Before - [ ] Test laptop and VPN access - [ ] Test PagerDuty app notifications - [ ] Ensure mobile phone is charged - [ ] Review monitoring dashboards - [ ] Check for scheduled deployments or maintenance ### Start of Shift - [ ] Post in #on-call channel: "Starting on-call shift" - [ ] Review open incidents and alerts - [ ] Check upcoming changes in deployment calendar - [ ] Skim through recent post-mortems - [ ] Verify access to all critical systems ## During Your Shift ### When Alert Fires 1. **Acknowledge** (within 5 minutes for P1/P2) - Open PagerDuty alert - Click "Acknowledge" - Alert stops paging 2. **Initial Assessment** (first 5 minutes) - Read alert description - Check alert dashboard link - Assess severity (is P1 correct?) - Check recent changes (deployment, config) 3. **Decide Next Steps** - If **clear fix**: Implement and monitor - If **quick rollback**: Execute rollback - If **unclear**: Declare incident and get help ### When to Declare Incident Declare incident (create war room) if: - You're unsure how to fix (need help) - User impact is significant - Will take > 30 minutes to resolve - Multiple systems affected ### When to Escalate Escalate to secondary on-call if: - You're overwhelmed (multiple alerts) - You need specific expertise - You're stuck (30+ min no progress) Escalate to manager if: - P1 incident lasting > 1 hour - User data at risk - Security incident - Need executive decision ### Alert Hygiene After each alert: - [ ] Update incident ticket with resolution - [ ] Mark alert as "resolved" in PagerDuty - [ ] If false positive: Create ticket to tune alert - [ ] If new issue: Create ticket to fix root cause - [ ] Update runbook if you learned something new ## End of Shift ### Handoff Checklist - [ ] Post in #on-call: "Ending on-call shift" - [ ] List open incidents and their status - [ ] Note any ongoing issues or concerns - [ ] Mention scheduled changes in next 24 hours - [ ] Thank the outgoing on-call ### Feedback and Improvement - [ ] Log toil reduction opportunities - [ ] Update runbooks based on what you learned - [ ] File tickets for alert improvements - [ ] Provide feedback on on-call process ## Common Scenarios ### Scenario: High Error Rate Alert 1. Check dashboard: Which service? Which endpoint? 2. Check recent deployments: Anything in last hour? 3. Check logs: What errors are users seeing? 4. If recent deployment: Consider rollback 5. If not recent: Investigate dependencies ### Scenario: High Latency Alert 1. Check dashboard: Which percentile? How high? 2. Check database: Slow queries? Connection pool full? 3. Check dependencies: External APIs slow? 4. Check traffic: Unusual spike in requests? 5. Consider scaling if capacity issue ### Scenario: Service Down Alert 1. Check monitoring: Complete outage or partial? 2. Check infrastructure: Servers running? Network okay? 3. Check recent changes: Deployment? Config change? 4. Restart if safe (stateless services) 5. Rollback if recent deployment ## Emergency Contacts Primary Escalation: - Secondary On-Call: [PagerDuty escalation] - Team Lead: [Phone number] - Engineering Manager: [Phone number] SMEs (Subject Matter Experts): - Database: [Name, phone] - Networking: [Name, phone] - Security: [Name, phone] - Cloud Infrastructure: [Name, phone] External: - Cloud Provider Support: [Phone, ticket system] - Third-party Vendor Support: [Phone, ticket system] ## Useful Links Dashboards: - [Overall System Health Dashboard] - [Service-Specific Dashboards] - [Infrastructure Dashboard] Runbooks: - [Runbook Index] - [Common Incident Scenarios] - [Rollback Procedures] Tools: - [PagerDuty Incidents] - [Grafana Dashboards] - [Log Aggregation (ELK/Splunk)] - [Deployment Tool] - [ChatOps (Slack)] ``` ## Incident Management Tools ### Tool Comparison | Feature | PagerDuty | Opsgenie | Splunk On-Call | Incident.io | FireHydrant | |---------|-----------|----------|----------------|-------------|-------------| | **Alerting** | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓ | ✓✓ | | **On-Call Scheduling** | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓ | ✓✓ | | **Incident Timeline** | ✓✓ | ✓✓ | ✓ | ✓✓✓ | ✓✓✓ | | **Status Page Integration** | ✓✓✓ | ✓✓ | ✓✓ | ✓✓✓ | ✓✓✓ | | **Post-Mortem Templates** | ✓ | ✓ | ✓ | ✓✓✓ | ✓✓✓ | | **Slack Integration** | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓✓ | | **Pricing** | $$$$ | $$$ | $$$ | $$$$ | $$$ | | **Best For** | Mature teams | Mid-size teams | Splunk users | Modern incident mgmt | Modern incident mgmt | ### PagerDuty Configuration Example ```python # PagerDuty API - Create Incident import requests import json PAGERDUTY_API_KEY = "YOUR_API_KEY" PAGERDUTY_EMAIL = "[email protected]" def create_incident(title, description, urgency="high", service_id="SERVICE_ID"): """Create a PagerDuty incident""" url = "https://api.pagerduty.com/incidents" headers = { "Authorization": f"Token token={PAGERDUTY_API_KEY}", "Content-Type": "application/json", "From": PAGERDUTY_EMAIL } payload = { "incident": { "type": "incident", "title": title, "service": { "id": service_id, "type": "service_reference" }, "urgency": urgency, "body": { "type": "incident_body", "details": description } } } response = requests.post(url, headers=headers, json=payload) if response.status_code == 201: incident = response.json()["incident"] print(f"Incident created: {incident['html_url']}") return incident else: print(f"Error: {response.status_code}") print(response.text) return None # Example: Create incident from monitoring alert if __name__ == "__main__": create_incident( title="High Error Rate on API", description="Error rate exceeded 5% for 10 minutes. Dashboard: https://grafana.example.com/...", urgency="high" ) ``` ## Runbook Development ### Runbook Template ```markdown # Runbook: [Service Name] - [Scenario] ## Service Overview - **Service**: API Gateway - **Team**: Backend Team - **On-Call**: #backend-oncall - **SME**: John Smith ([email protected]) ## Purpose This runbook covers troubleshooting and recovery procedures for the API Gateway service. ## Architecture ``` [Include architecture diagram or ASCII art] External Clients → API Gateway → Backend Services → Database ↓ Rate Limiter Auth Service ``` ## SLIs/SLOs - **Availability**: 99.9% (43 minutes downtime/month) - **Latency (p95)**: < 500ms - **Error Rate**: 5% - Users reporting "Service Unavailable" errors **Possible Causes**: 1. Backend services down or unhealthy 2. Database connection issues 3. Recent deployment issue 4. Upstream dependency failure **Diagnostic Steps**: ```bash # 1. Check backend service health kubectl get pods -n backend kubectl describe pod -n backend # 2. Check API Gateway logs kubectl logs -f deployment/api-gateway -n gateway --tail=100 # 3. Check recent deployments kubectl rollout history deployment/api-gateway -n gateway # 4. Check database connections # (Connect to app pod and run) kubectl exec -it -n backend -- /bin/sh netstat -an | grep 5432 | grep ESTABLISHED | wc -l # 5. Check upstream dependencies curl https://auth-service/health curl https://payment-service/health ``` **Resolution Steps**: If recent deployment (last 30 minutes): ```bash # Rollback deployment kubectl rollout undo deployment/api-gateway -n gateway kubectl rollout status deployment/api-gateway -n gateway # Verify error rate dropping # Check dashboard: https://grafana.example.com/d/api-gateway ``` If database connection issue: ```bash # Restart API Gateway pods (will reset connection pools) kubectl rollout restart deployment/api-gateway -n gateway # Monitor for improvement watch kubectl get pods -n gateway ``` If upstream dependency down: ```bash # Check status pages of dependencies # Escalate to owning team # Consider enabling fallback mode if available ``` **Escalation**: - If not resolved in 15 minutes: Page secondary on-call - If backend services issue: Page backend team - If database issue: Page database team ### Issue 2: High Latency **Symptoms**: - Alert: "HighLatency" firing - Dashboard shows p95 latency > 1000ms - Users reporting slow responses **Possible Causes**: 1. Database slow queries 2. High traffic / insufficient capacity 3. Downstream service latency 4. Memory/CPU saturation **Diagnostic Steps**: ```bash # 1. Check pod resources kubectl top pods -n gateway # 2. Check HPA status (auto-scaling) kubectl get hpa -n gateway # 3. Check slow queries # (Connect to database) SELECT pid, query, query_start, state FROM pg_stat_activity WHERE state != 'idle' AND (now() - query_start) > interval '5 seconds' ORDER BY query_start; # 4. Check downstream services curl -w "@curl-format.txt" https://service-a/health curl -w "@curl-format.txt" https://service-b/health ``` **Resolution Steps**: If capacity issue (CPU/memory high): ```bash # Scale up deployment kubectl scale deployment/api-gateway -n gateway --replicas=10 # Or wait for HPA to scale (if configured) kubectl get hpa -n gateway -w ``` If slow database queries: ```sql -- Kill long-running query (use with caution) SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = ; ``` ### Issue 3: Complete Service Outage **Symptoms**: - Alert: "ServiceDown" firing - Dashboard shows 0 requests/sec - All health checks failing **Immediate Actions**: 1. Declare P1 incident 2. Create war room: #incident-YYYY-MM-DD-api-outage 3. Page backup on-call and team lead **Diagnostic Steps**: ```bash # 1. Check if pods are running kubectl get pods -n gateway # 2. Check deployment status kubectl get deployment api-gateway -n gateway # 3. Check recent changes git log --since="1 hour ago" --oneline # 4. Check infrastructure (nodes, network) kubectl get nodes kubectl describe node ``` **Resolution Steps**: [Detailed recovery steps based on cause] ## Related Runbooks - [Database Troubleshooting Runbook](link) - [Kubernetes Troubleshooting Runbook](link) - [Rollback Procedures](link) ## Useful Dashboards - [API Gateway Dashboard](https://grafana.example.com/d/api-gateway) - [Backend Services Dashboard](https://grafana.example.com/d/backend) - [Infrastructure Dashboard](https://grafana.example.com/d/infrastructure) ## Useful Commands ```bash # Check logs kubectl logs -f deployment/api-gateway -n gateway --tail=100 # Get shell in pod kubectl exec -it -n gateway -- /bin/bash # Port forward to local machine kubectl port-forward deployment/api-gateway 8080:8080 -n gateway # Describe resource kubectl describe pod -n gateway # Check events kubectl get events -n gateway --sort-by='.lastTimestamp' ``` ## Recent Changes - 2025-01-10: Added HPA configuration (autoscaling) - 2024-12-15: Increased connection pool size to 50 - 2024-11-20: Updated rollback procedure ## Document Info - **Last Updated**: 2025-01-15 - **Owner**: Backend Team - **Review Cycle**: Monthly ``` ## Metrics and Improvement ### Key Metrics to Track ```yaml Incident Metrics: MTTA (Mean Time to Acknowledge): Definition: Time from alert to acknowledgment Target: < 5 minutes for P1, < 15 minutes for P2 Calculation: Sum(ack_time - alert_time) / Count(incidents) MTTI (Mean Time to Identify): Definition: Time from acknowledgment to root cause identified Target: < 30 minutes for P1, < 2 hours for P2 Calculation: Sum(identified_time - ack_time) / Count(incidents) MTTR (Mean Time to Recovery): Definition: Time from alert to resolution Target: < 1 hour for P1, 720 hours (30 days) Calculation: Total operational time / Count(incidents) Quality Metrics: Incident Recurrence Rate: Definition: % of incidents that recur within 90 days Target: 90% Calculation: Completed on time / Total action items × 100 Runbook Coverage: Definition: % of services with up-to-date runbooks Target: 100% Calculation: Services with runbooks / Total services × 100 On-Call Metrics: Alert Volume: Definition: Number of alerts per on-call shift Target: < 20 per week Measurement: Count by week False Positive Rate: Definition: % of alerts that don't require action Target: < 20% Calculation: False positives / Total alerts × 100 After-Hours Pages: Definition: Pages outside business hours Target: 20% = tune or remove) - Update alert thresholds - Create tickets for recurring issues On-Call Feedback: - Collect feedback from outgoing on-call - Identify toil reduction opportunities - Update runbooks based on learnings Monthly: Incident Retrospective: - Review all incidents from past month - Analyze trends (common causes, affected services) - Track MTTA, MTTI, MTTR trends - Review action item completion rate Runbook Audit: - Review and update all runbooks - Test procedures - Remove outdated information Training: - Onboard new on-call engineers - Incident response drills/simulations - Share learnings from recent incidents Quarterly: Metrics Review: - Present incident metrics to leadership - Track progress on reduction targets - Identify systemic issues - Celebrate improvements Process Improvements: - Review incident management process - Gather team feedback - Implement process changes - Update documentation ``` This comprehensive incident management guide provides all the tools and processes needed for effective incident response and continuous improvement.

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI