Be the First Responder — L1 Production Support

By the end of this page, you will understand how L1 Support triages production alerts, follows runbooks, and escalates effectively — and how AI agents can automate first-response workflows.

Production Support (First Response) — The 2-Minute Overview

Chapter 16 Cartoon — Have You Tried Refreshing?

Think about the last time you called a customer support hotline. The first person you spoke to didn't redesign the product — they triaged your issue, followed a script, and either resolved it or escalated to a specialist. That first responder is L1 Support. In production systems, L1 is the first human to see an alert, follow a runbook, and decide: "Can I fix this, or does this need L2?"

graph LR subgraph INPUT["Alert Inputs"] I1["Monitoring Alerts"] I2["User Reports"] I3["Automated Health Checks"] end subgraph L1["L1 Support"] L1A["Triage — Severity & Impact"] L1B["Runbook Execution"] L1C["Escalation Decision"] end subgraph OUTPUT["L1 Outputs"] O1["Resolved Incident"] O2["Escalation to L2"] O3["Incident Log"] end I1 --> L1A I2 --> L1A I3 --> L1A L1A --> L1B L1B -->|"Resolved"| O1 L1B -->|"Can't resolve"| L1C L1C --> O2 L1A --> O3 style INPUT fill:#16213e,stroke:#0f3460,color:#fff style L1 fill:#1a1a2e,stroke:#e94560,color:#fff style OUTPUT fill:#006400,stroke:#00cc00,color:#fff

You Already Know L1 Support — You Just Don't Know It Yet

You've been doing L1 support every time you helped a family member with a tech problem.

📱 The Family Tech Support Analogy

Step 1 — Triage: "What's wrong?" → "My phone is slow." → "How slow? All apps or one app?"

🔗 L1 Layer: ① TRIAGE — Determine severity and scope. Is it one user or all users? One service or the whole system?

Step 2 — Runbook: "Have you tried restarting it? Is storage full? Is there an update available?"

🔗 L1 Layer: ② RUNBOOK EXECUTION — Follow known resolution steps. Restart service? Clear cache? Check config?

Step 3 — Escalate: "I can't figure this out. Let me call your phone company."

🔗 L1 Layer: ③ ESCALATION — If the runbook doesn't resolve it, escalate with full context to L2.

The Complete Mapping

Family Tech Support	L1 Production Support	Phase
"What's wrong? How severe?"	Triage: severity, scope, impact	① Triage
"Restart? Clear storage? Update?"	Follow runbook: restart, clear cache, check config	② Runbook
"I'll call the phone company"	Escalate to L2 with context	③ Escalate

You just learned L1 Support without touching a terminal.

The 4 Pillars of L1 Support

1. Alert Triage

Not all alerts are equal. Triage determines what to fix now, what can wait, and what's noise.

Categorize by severity (P0-P3), scope (one user vs. all users), and blast radius (one service vs. cascading). The first 2 minutes of triage determine the next 2 hours.

Severity	Criteria	Response
P0	System down, data loss risk	Immediate — all hands
P1	Major feature broken	Within 15 minutes
P2	Minor degradation	Within 1 hour
P3	Cosmetic / informational	Next business day

2. Runbook Execution

A runbook turns a 3am panic into a 3am checklist.

Runbooks are step-by-step instructions for known failure modes. "If alert X fires: check Y, run Z, verify W." Good runbooks are specific and testable. Bad runbooks say "investigate" — that's not a step.

Runbook Quality	Example	Outcome
Good	"Run `kubectl get pods -n payments`. If CrashLoopBackOff, run `kubectl delete pod` "	L1 resolves in 5 minutes
Bad	"Check the payments service and investigate"	L1 spends 30 minutes figuring out what "investigate" means

3. Escalation Protocol

Escalation is not failure — it's routing the problem to the right expert.

Escalate when: the runbook doesn't resolve within 15 minutes, the issue is outside L1's scope (code bug, infrastructure failure), or severity is P0 (immediate escalation regardless). Always escalate with: what you tried, what you observed, and what you suspect.

Escalation Info	What to Include	Why
What you tried	Steps executed from runbook	L2 doesn't repeat your work
What you observed	Logs, metrics, error messages	L2 starts from data, not zero
What you suspect	Your hypothesis (even if wrong)	L2 has a starting point

4. Incident Logging

If it's not logged, it didn't happen. Incident logs are the memory of your system's reliability.

Every alert, every action, every escalation — logged. This creates patterns: "This alert fires every Monday at 3am. Why?" Incident logs feed into postmortems and retrospectives.

Log Entry	Content	Purpose
Alert Details	What fired, when, severity	Timeline reconstruction
Actions Taken	Steps from runbook	Audit trail
Resolution/Escalation	How it was resolved or who it was escalated to	Tracking and accountability

The Complete Mapping

#	Pillar	What It Answers	Key Decision
①	Triage	How bad is it?	Severity + scope + blast radius
②	Runbook	How do I fix it?	Step-by-step known resolution
③	Escalation	When do I hand it off?	15-min rule + context transfer
④	Logging	What happened?	Every action documented

Try It Yourself — A Starter Prompt for L1 Support

You are an L1 Production Support engineer.

I need an incident response plan for:

{{PASTE YOUR SYSTEM DESCRIPTION AND COMMON ALERT TYPES}}

Cover these 4 areas:

1. TRIAGE FRAMEWORK — Define severity levels with criteria and response times.
2. RUNBOOKS — Write runbooks for the 3 most common alerts. Each must have specific, executable steps.
3. ESCALATION PROTOCOL — Define when to escalate and what context to include.
4. INCIDENT LOG TEMPLATE — Design a log template for tracking every alert and action.

For each area, provide: the plan and a brief justification.

What This Prompt Covers vs. What It Misses

Skill	Lite Prompt (Free)	Full Prompt (Course)	Impact of Missing It
Severity framework	✅ Covered	✅ Covered	—
Runbook structure	✅ Covered	✅ Covered	—
Automated triage (AI-assisted alert classification)	❌ Missing	✅ Agent classifies alerts by pattern	L1 manually reads every alert — slow triage at scale
Runbook validation (test the runbook before production)	❌ Missing	✅ Runbook testing in staging	Runbook says "restart pod" but the command is wrong. Discovered at 3am.
Communication templates (stakeholder updates)	❌ Missing	✅ Status page update templates	Incident ongoing. No status update. Stakeholders escalate independently. Chaos.

The Lite Prompt gets you to ~60% quality. Good enough to respond to alerts. Not good enough to resolve them consistently.

Real-World Example: L1 Support for an E-Commerce Platform

The Requirement

"Design L1 support for an e-commerce platform: order processing, payment gateway, and inventory service. Handle: slow checkout, payment failures, and inventory sync errors."

Lite Prompt Output

① Triage: Payment failure = P0. Slow checkout = P1. Inventory sync = P2.

② Runbooks: Payment failure: check Stripe status page, restart payment service. Slow checkout: check DB connections. Inventory sync: re-run sync job.

③ Escalation: If not resolved in 15 min, escalate to L2 with logs.

④ Log: Alert, time, severity, actions, resolution.

What an L1 Manager Would Catch

Area	Lite Says	What's Missing	Consequence
Triage	"Payment failure = P0"	What if 1 user's payment fails vs. all payments fail? Same severity?	1 user's card decline marked P0. All-hands-on-deck for a typo. Severity inflation.
Runbook	"Check Stripe status page"	What if Stripe is up but our integration is broken? Next step?	Stripe status: green. L1 closes ticket as "resolved." Payments still failing. Users churning.
Escalation	"Escalate with logs"	Which logs? From which service? How to access them?	L1 escalates: "Payments are broken. Here's the alert." L2: "Where are the logs?" — 20 min lost.
Log	"Actions, resolution"	No customer impact tracking. How many users affected?	Postmortem: "How many customers were impacted?" L1: "We didn't track that."

Ready to Be the First Responder?

✅ The complete prompt with AI-assisted triage, tested runbooks, and communication templates
✅ An AI agent that triages alerts and follows runbooks automatically
✅ Assessment + coding challenges to verify you can respond, not just describe response

Enroll in the Fresh Graduate AI SDLC Course →
Go from "I understand support" to "I can resolve production incidents at 3am."

← Chapter 15 Course Home Chapter 17 →