Coordinate Incident Response Like a Pro — Incident Management
By the end of this page, you will understand how Incident Managers coordinate response, drive root cause analysis, and ensure preventive measures — and how AI agents can streamline incident coordination.
Incident Management — The 2-Minute Overview
Think about the last time you saw a fire department respond to an emergency. The fire captain doesn't fight the fire alone — they coordinate: assign teams to entry/ventilation/rescue, communicate with dispatch, make real-time decisions, and after the fire, lead the investigation into what happened and how to prevent it. That captain is the Incident Manager.
You Already Know Incident Management — You Just Don't Know It Yet
You've been an Incident Manager every time you handled a kitchen fire at home.
🔥 The Kitchen Fire Analogy
Step 1 — Coordinate: Turn off stove (stop the damage), open windows (reduce blast radius), call fire dept if needed (escalate).
🔗 IM Layer: ① COORDINATE — Assign roles, communicate status, contain the blast radius.
Step 2 — RCA: Why did it catch fire? Oil too hot? Left unattended? Burner malfunction?
🔗 IM Layer: ② ROOT CAUSE ANALYSIS — Drive the 5 Whys. Find the fundamental cause.
Step 3 — Prevent: Buy a fire extinguisher. Set timer when frying. Get burner inspected.
🔗 IM Layer: ③ PREVENTION — Ensure action items are tracked and implemented.
The Complete Mapping
| Kitchen Fire | Incident Management | Phase |
|---|---|---|
| Turn off stove, open windows | Contain blast radius, assign responders | ① Coordinate |
| "Oil too hot? Left unattended?" | Drive RCA: 5 Whys, timeline, evidence | ② Root Cause |
| Buy extinguisher, set timer | Track action items, verify implementation | ③ Prevent |
The 5 Pillars of Incident Management
1. Incident Coordination
The Incident Manager doesn't fix the system — they coordinate the people who do.
During an active incident: declare the incident (severity, scope), assign roles (incident commander, communications lead, technical lead), establish a war room (Slack channel, Zoom bridge), and provide regular status updates.
| Role | Responsibility | Who |
|---|---|---|
| Incident Commander | Makes decisions, prioritizes actions | Incident Manager |
| Technical Lead | Diagnoses and applies fixes | L2 / Senior Engineer |
| Communications Lead | Updates stakeholders and status page | Incident Manager or designated |
2. Blameless Post-Mortem
A blame-ful post-mortem stops at "who." A blameless one asks "what about our system allowed this to happen?"
Post-mortems are conducted after every P0/P1 incident. Focus on systems and processes, not individuals. Document: timeline, root cause, impact, what went well, what went wrong, and action items.
| Section | Content | Purpose |
|---|---|---|
| Timeline | Minute-by-minute from detection to resolution | Understand the sequence |
| Root Cause | The fundamental system/process failure | Prevent recurrence |
| Impact | Users affected, revenue lost, SLO impact | Quantify the damage |
| Action Items | Specific, assigned, deadlined | Ensure follow-through |
3. Communication During Incidents
Silence during an incident is worse than bad news. Stakeholders need updates, even if the update is 'still investigating.'
Communicate: what's happening, who's affected, what we're doing, when the next update is. Cadence: every 15 minutes for P0, every 30 minutes for P1.
| Audience | Channel | Cadence |
|---|---|---|
| Engineering | War room (Slack/Zoom) | Real-time |
| Leadership | Email / Slack summary | Every 15 min (P0) |
| Customers | Status page | Every 30 min |
4. Action Item Tracking
The post-mortem's value is zero if action items aren't tracked to completion.
Every action item: assigned to a person, has a deadline, is tracked in the backlog, and is verified as complete. Untracked action items = recurring incidents.
| Action Item Quality | Example | Outcome |
|---|---|---|
| Good | "Add connection pool monitoring by Sprint 23, assigned to @alice" | Tracked, completed, verified |
| Bad | "Improve monitoring" | Vague, unassigned, forgotten |
5. Incident Metrics
If you don't measure incident response, you can't improve it.
Track: Mean Time to Detect (MTTD), Mean Time to Respond (MTTR), Mean Time Between Failures (MTBF), and incident frequency by service.
| Metric | Measures | Target |
|---|---|---|
| MTTD | Time from failure to detection | < 5 minutes |
| MTTR | Time from detection to resolution | < 30 minutes (P0) |
| MTBF | Time between incidents | Increasing trend |
| Recurrence Rate | Same root cause appearing again | 0% (action items working) |
The Complete Mapping
| # | Pillar | What It Answers | Key Decision |
|---|---|---|---|
| ① | Coordination | Who does what during an incident? | Roles, war room, status cadence |
| ② | Post-Mortem | What happened and why? | Blameless, timeline, root cause |
| ③ | Communication | Who needs to know, and when? | Audience, channel, cadence |
| ④ | Action Tracking | Will we actually fix it? | Assigned, deadlined, verified |
| ⑤ | Metrics | Are we getting better? | MTTD, MTTR, MTBF, recurrence |
Try It Yourself — A Starter Prompt for Incident Management
You are an Incident Manager with experience coordinating P0/P1 incidents.
I need an incident management plan for:
{{PASTE YOUR SYSTEM AND TEAM CONTEXT}}
Cover these 5 areas:
1. COORDINATION — Define roles, war room setup, and decision-making structure during incidents.
2. POST-MORTEM — Design a blameless post-mortem template with required sections.
3. COMMUNICATION — Define communication cadence per severity level and per audience.
4. ACTION TRACKING — How will action items be tracked, assigned, and verified?
5. METRICS — Define the incident metrics to track and improvement targets.
For each area, provide: the plan and a brief justification.
What This Prompt Covers vs. What It Misses
| Skill | Lite Prompt (Free) | Full Prompt (Course) | Impact of Missing It |
|---|---|---|---|
| Coordination structure | ✅ Covered | ✅ Covered | — |
| Post-mortem template | ✅ Covered | ✅ Covered | — |
| Pre-written communication templates | ❌ Missing | ✅ "Status update: we are aware of [X], impact is [Y], next update at [Z]" | 15-minute update cadence but each update takes 10 minutes to draft. Communication becomes the bottleneck. |
| Incident severity auto-classification | ❌ Missing | ✅ AI agent classifies severity from alert data | Human triages severity manually. Disagrees with L1. 10 minutes debating severity instead of fixing. |
| Post-mortem facilitation guide | ❌ Missing | ✅ Minute-by-minute facilitation of the post-mortem meeting | Post-mortem devolves into blame. Team stops sharing honestly. |
| Cross-incident trend analysis | ❌ Missing | ✅ "These 3 incidents share the same root cause pattern" | Same root cause, three separate post-mortems, three separate action items. Pattern not detected. |
The Lite Prompt gets you to ~60% quality. Good enough to coordinate. Not good enough to drive systematic incident prevention.
Real-World Example: Managing a Payment Outage
The Requirement
"Manage a P0 incident: payment processing is down for all users. Duration: 45 minutes so far. Revenue impact: $50K/hour."
Lite Prompt Output
① Coordination: Declare P0, assign tech lead (Senior Engineer), set up Slack channel, 15-min updates.
② Post-Mortem: Timeline, root cause, impact, action items. Schedule within 48 hours.
③ Communication: Engineering — real-time in Slack. Leadership — every 15 min. Customers — status page.
④ Actions: Track in Jira, assign owners, deadline within 2 sprints.
⑤ Metrics: MTTD, MTTR, track monthly trend.
What a VP of Engineering Would Catch
| Area | Lite Says | What's Missing | Consequence |
|---|---|---|---|
| Coordination | "Assign tech lead" | No backup plan. What if the Senior Engineer is unavailable? | 3am. Senior Engineer doesn't answer phone. 20 minutes finding a backup. Revenue: $17K lost in those 20 minutes. |
| Post-Mortem | "Schedule within 48 hours" | No pre-work. Attendees arrive unprepared. | Post-mortem becomes a 2-hour timeline reconstruction that should've been done beforehand. |
| Communication | "Status page" | No customer communication template. What exactly goes on the status page? | Status page says "investigating." No ETA, no impact scope, no workaround. Customers tweet frustration. PR crisis. |
| Actions | "Track in Jira" | No verification process. Who confirms the action was effective? | Action item completed: "Add monitoring." Monitoring added but alert threshold set too high. Same incident recurs. |
| Metrics | "MTTD, MTTR" | No business impact metric. MTTR was 45 min — but what was the revenue impact? | Engineering says "45 min MTTR — good." CFO says "$37.5K lost — unacceptable." Misaligned measurement. |
Ready to Manage Incidents Like a Pro?
- ✅ The complete prompt with communication templates, facilitation guides, and AI severity classification
- ✅ An AI agent that drives RCA and tracks preventive measures
- ✅ Assessment + coding challenges to verify you can coordinate, not just describe
Go from "I understand incident management" to "I can coordinate a P0 and ensure it never recurs."