Be the First Responder — L1 Production Support

By the end of this page, you will understand how L1 Support triages production alerts, follows runbooks, and escalates effectively — and how AI agents can automate first-response workflows.

Production Support (First Response) — The 2-Minute Overview

Chapter 16 Cartoon — Have You Tried Refreshing?

Think about the last time you called a customer support hotline. The first person you spoke to didn't redesign the product — they triaged your issue, followed a script, and either resolved it or escalated to a specialist. That first responder is L1 Support. In production systems, L1 is the first human to see an alert, follow a runbook, and decide: "Can I fix this, or does this need L2?"

graph LR subgraph INPUT["Alert Inputs"] I1["Monitoring Alerts"] I2["User Reports"] I3["Automated Health Checks"] end subgraph L1["L1 Support"] L1A["Triage — Severity & Impact"] L1B["Runbook Execution"] L1C["Escalation Decision"] end subgraph OUTPUT["L1 Outputs"] O1["Resolved Incident"] O2["Escalation to L2"] O3["Incident Log"] end I1 --> L1A I2 --> L1A I3 --> L1A L1A --> L1B L1B -->|"Resolved"| O1 L1B -->|"Can't resolve"| L1C L1C --> O2 L1A --> O3 style INPUT fill:#16213e,stroke:#0f3460,color:#fff style L1 fill:#1a1a2e,stroke:#e94560,color:#fff style OUTPUT fill:#006400,stroke:#00cc00,color:#fff

You Already Know L1 Support — You Just Don't Know It Yet

You've been doing L1 support every time you helped a family member with a tech problem.

📱 The Family Tech Support Analogy

Step 1 — Triage: "What's wrong?" → "My phone is slow." → "How slow? All apps or one app?"

🔗 L1 Layer: ① TRIAGE — Determine severity and scope. Is it one user or all users? One service or the whole system?

Step 2 — Runbook: "Have you tried restarting it? Is storage full? Is there an update available?"

🔗 L1 Layer: ② RUNBOOK EXECUTION — Follow known resolution steps. Restart service? Clear cache? Check config?

Step 3 — Escalate: "I can't figure this out. Let me call your phone company."

🔗 L1 Layer: ③ ESCALATION — If the runbook doesn't resolve it, escalate with full context to L2.

The Complete Mapping

Family Tech SupportL1 Production SupportPhase
"What's wrong? How severe?"Triage: severity, scope, impact① Triage
"Restart? Clear storage? Update?"Follow runbook: restart, clear cache, check config② Runbook
"I'll call the phone company"Escalate to L2 with context③ Escalate
You just learned L1 Support without touching a terminal.


The 4 Pillars of L1 Support

1. Alert Triage

Not all alerts are equal. Triage determines what to fix now, what can wait, and what's noise.

Categorize by severity (P0-P3), scope (one user vs. all users), and blast radius (one service vs. cascading). The first 2 minutes of triage determine the next 2 hours.

SeverityCriteriaResponse
P0System down, data loss riskImmediate — all hands
P1Major feature brokenWithin 15 minutes
P2Minor degradationWithin 1 hour
P3Cosmetic / informationalNext business day

2. Runbook Execution

A runbook turns a 3am panic into a 3am checklist.

Runbooks are step-by-step instructions for known failure modes. "If alert X fires: check Y, run Z, verify W." Good runbooks are specific and testable. Bad runbooks say "investigate" — that's not a step.

Runbook QualityExampleOutcome
Good"Run kubectl get pods -n payments. If CrashLoopBackOff, run kubectl delete pod "L1 resolves in 5 minutes
Bad"Check the payments service and investigate"L1 spends 30 minutes figuring out what "investigate" means

3. Escalation Protocol

Escalation is not failure — it's routing the problem to the right expert.

Escalate when: the runbook doesn't resolve within 15 minutes, the issue is outside L1's scope (code bug, infrastructure failure), or severity is P0 (immediate escalation regardless). Always escalate with: what you tried, what you observed, and what you suspect.

Escalation InfoWhat to IncludeWhy
What you triedSteps executed from runbookL2 doesn't repeat your work
What you observedLogs, metrics, error messagesL2 starts from data, not zero
What you suspectYour hypothesis (even if wrong)L2 has a starting point

4. Incident Logging

If it's not logged, it didn't happen. Incident logs are the memory of your system's reliability.

Every alert, every action, every escalation — logged. This creates patterns: "This alert fires every Monday at 3am. Why?" Incident logs feed into postmortems and retrospectives.

Log EntryContentPurpose
Alert DetailsWhat fired, when, severityTimeline reconstruction
Actions TakenSteps from runbookAudit trail
Resolution/EscalationHow it was resolved or who it was escalated toTracking and accountability

The Complete Mapping

#PillarWhat It AnswersKey Decision
TriageHow bad is it?Severity + scope + blast radius
RunbookHow do I fix it?Step-by-step known resolution
EscalationWhen do I hand it off?15-min rule + context transfer
LoggingWhat happened?Every action documented


Try It Yourself — A Starter Prompt for L1 Support

You are an L1 Production Support engineer.

I need an incident response plan for:

{{PASTE YOUR SYSTEM DESCRIPTION AND COMMON ALERT TYPES}}

Cover these 4 areas:

1. TRIAGE FRAMEWORK — Define severity levels with criteria and response times.
2. RUNBOOKS — Write runbooks for the 3 most common alerts. Each must have specific, executable steps.
3. ESCALATION PROTOCOL — Define when to escalate and what context to include.
4. INCIDENT LOG TEMPLATE — Design a log template for tracking every alert and action.

For each area, provide: the plan and a brief justification.

What This Prompt Covers vs. What It Misses

SkillLite Prompt (Free)Full Prompt (Course)Impact of Missing It
Severity framework✅ Covered✅ Covered
Runbook structure✅ Covered✅ Covered
Automated triage (AI-assisted alert classification)❌ Missing✅ Agent classifies alerts by patternL1 manually reads every alert — slow triage at scale
Runbook validation (test the runbook before production)❌ Missing✅ Runbook testing in stagingRunbook says "restart pod" but the command is wrong. Discovered at 3am.
Communication templates (stakeholder updates)❌ Missing✅ Status page update templatesIncident ongoing. No status update. Stakeholders escalate independently. Chaos.
The Lite Prompt gets you to ~60% quality. Good enough to respond to alerts. Not good enough to resolve them consistently.


Real-World Example: L1 Support for an E-Commerce Platform

The Requirement

"Design L1 support for an e-commerce platform: order processing, payment gateway, and inventory service. Handle: slow checkout, payment failures, and inventory sync errors."

Lite Prompt Output

① Triage: Payment failure = P0. Slow checkout = P1. Inventory sync = P2.

② Runbooks: Payment failure: check Stripe status page, restart payment service. Slow checkout: check DB connections. Inventory sync: re-run sync job.

③ Escalation: If not resolved in 15 min, escalate to L2 with logs.

④ Log: Alert, time, severity, actions, resolution.


What an L1 Manager Would Catch

AreaLite SaysWhat's MissingConsequence
Triage"Payment failure = P0"What if 1 user's payment fails vs. all payments fail? Same severity?1 user's card decline marked P0. All-hands-on-deck for a typo. Severity inflation.
Runbook"Check Stripe status page"What if Stripe is up but our integration is broken? Next step?Stripe status: green. L1 closes ticket as "resolved." Payments still failing. Users churning.
Escalation"Escalate with logs"Which logs? From which service? How to access them?L1 escalates: "Payments are broken. Here's the alert." L2: "Where are the logs?" — 20 min lost.
Log"Actions, resolution"No customer impact tracking. How many users affected?Postmortem: "How many customers were impacted?" L1: "We didn't track that."


Ready to Be the First Responder?

Enroll in the Fresh Graduate AI SDLC Course →

Go from "I understand support" to "I can resolve production incidents at 3am."
← Chapter 15 Course Home Chapter 17 →