Maintain 99.9% Uptime — Site Reliability Engineering

By the end of this page, you will understand how SREs define SLOs/SLIs, set up monitoring with Prometheus and Grafana, and build self-healing infrastructure — and how AI agents can automate operations.

Operations & Monitoring — The 2-Minute Overview

Chapter 15 Cartoon — The 3 AM Alert

Think about the last time you stayed in a hotel. You didn't see the building management system — monitoring room temperatures, water pressure, elevator status, and fire alarms 24/7. You just had hot water and working Wi-Fi. But somebody had to define "acceptable" (water between 38-42°C), set up monitoring (sensors in every room), and build automatic responses (pressure drops → backup pump activates). That building management is SRE.

graph LR subgraph INPUT["SRE Inputs"] I1["System Architecture"] I2["Business Requirements"] I3["Incident History"] end subgraph SRE["SRE Operations"] S1["Define SLOs / SLIs"] S2["Set Up Monitoring & Alerting"] S3["Build Self-Healing Infrastructure"] end subgraph OUTPUT["SRE Outputs"] O1["Prometheus/Grafana Dashboards"] O2["Alert Runbooks"] O3["Auto-Recovery Systems"] end I1 --> S1 I2 --> S1 I3 --> S3 S1 --> S2 S2 --> S3 S3 --> O1 S3 --> O2 S3 --> O3 style INPUT fill:#16213e,stroke:#0f3460,color:#fff style SRE fill:#1a1a2e,stroke:#e94560,color:#fff style OUTPUT fill:#006400,stroke:#00cc00,color:#fff

You Already Know SRE — You Just Don't Know It Yet

You've been an SRE every time you maintained an aquarium.

🐠 The Aquarium Analogy

Step 1 — Define SLOs: Water temperature: 24-26°C. pH: 7.0-7.5. Oxygen: >6 mg/L.

🔗 SRE Layer: ① SLOs/SLIs — Define what "healthy" means for each service.

Step 2 — Set up monitoring: Thermometer (always visible), pH test kit (weekly), oxygen meter (daily).

🔗 SRE Layer: ② MONITORING — Prometheus collects metrics, Grafana displays dashboards, alerts fire when thresholds are breached.

Step 3 — Self-healing: Heater automatically maintains temperature. Filter runs continuously. Auto-feeder dispenses food.

🔗 SRE Layer: ③ SELF-HEALING — Auto-scaling on high CPU, auto-restart on container crash, circuit breakers on dependency failure.

The Complete Mapping

Aquarium	SRE	Phase
Temperature: 24-26°C, pH: 7.0-7.5	SLOs: p99 < 200ms, uptime > 99.9%	① Define SLOs
Thermometer, pH kit, oxygen meter	Prometheus metrics, Grafana dashboards, PagerDuty alerts	② Monitor
Heater auto-adjusts, filter runs continuously	Auto-scaling, auto-restart, circuit breakers	③ Self-Heal

You just learned SRE without opening a terminal.

The 5 Pillars of Site Reliability Engineering

1. SLOs and SLIs

SLOs are promises to users. SLIs measure if you're keeping them. Error budgets tell you how much room you have.

Define an SLO for each critical user journey. Measure it with an SLI. Track your error budget — the amount of allowed unreliability.

Concept	What It Means	Example
SLO	Target reliability level	99.9% of login requests succeed in < 500ms
SLI	Actual measurement	This week: 99.7% succeeded in < 500ms
Error Budget	Allowed failures before action required	0.1% in 30 days = 43 minutes of downtime allowed

2. Monitoring with Prometheus & Grafana

If you're not measuring it, you're guessing. If you're measuring the wrong thing, you're confidently wrong.

Prometheus scrapes metrics (request count, latency, error rate, resource utilization). Grafana visualizes them in dashboards. Together, they answer: "Is the system healthy right now?" and "Is it trending toward unhealthy?"

Component	What It Does	Use Case
Prometheus	Collects and stores time-series metrics	Latency, throughput, error rates
Grafana	Visualizes metrics in dashboards	Real-time health, trend analysis
Alertmanager	Routes alerts based on severity and team	Page on-call for P0, Slack for P2

3. Alerting Strategy

Every alert must be actionable. If an alert doesn't require human action, it's noise — remove it.

Good alerts: "p99 latency > 500ms for 5 minutes → investigate." Bad alerts: "CPU at 60%" — so what? Alert fatigue is real: too many alerts → humans ignore all of them → real incidents go unnoticed.

Concept	What It Means	When It Applies
Actionable Alerts	Every alert has a clear human action	Every alert definition
Severity Levels	P0 (page immediately) vs. P2 (next business day)	Alert routing
Alert Fatigue	Too many alerts → humans ignore them	Alert review cadence

4. Self-Healing Infrastructure

The best incident is the one that resolves itself before a human notices.

Auto-scaling: add instances when CPU > 70%. Auto-restart: container crashes → orchestrator restarts within 30 seconds. Circuit breakers: dependency fails → serve cached/default response. Health checks: load balancer stops routing to unhealthy instances.

Mechanism	Trigger	Action
Auto-Scaling	CPU > 70% for 5 min	Add 2 instances
Auto-Restart	Health check fails 3 times	Restart container
Circuit Breaker	Dependency error rate > 50%	Return fallback response
Load Balancer	Health check HTTP 503	Route away from instance

5. Incident Response & On-Call

On-call isn't about being awake at 3am. It's about having runbooks that make 3am incidents solvable in 10 minutes.

On-call rotation ensures someone is always responsible. Runbooks provide step-by-step instructions for known failure modes. Escalation paths define who to call when runbooks aren't enough.

Concept	What It Means	When It Applies
On-Call Rotation	Always someone responsible	24/7 coverage
Runbooks	Step-by-step instructions per alert	Every known failure mode
Escalation	When to call the next level	Runbook doesn't resolve within 15 min

The Complete Mapping

#	Pillar	What It Answers	Key Decision
①	SLOs/SLIs	What's "reliable enough"?	Targets + measurements + error budgets
②	Monitoring	Is the system healthy?	Prometheus + Grafana + dashboards
③	Alerting	When should we act?	Actionable alerts, severity routing
④	Self-Healing	Can the system fix itself?	Auto-scale, restart, circuit break
⑤	Incident Response	What do we do when it breaks?	Rotation, runbooks, escalation

Master these 5 pillars, master reliability.

Try It Yourself — A Starter Prompt for SRE

This prompt gives you a working starting point. For the complete prompt — with Prometheus configs, Grafana dashboard JSON, and runbook templates — see the full course chapter →.

You are a Senior SRE with experience in Prometheus, Grafana, and self-healing infrastructure.

I need an SRE plan for:

{{PASTE YOUR SYSTEM ARCHITECTURE AND RELIABILITY REQUIREMENTS}}

Cover these 5 areas:

1. SLOs — Define SLOs for the 3 most critical user journeys. Include error budgets.
2. MONITORING — Design the Prometheus metrics and Grafana dashboards needed.
3. ALERTING — Define alerts with severity, threshold, and required action.
4. SELF-HEALING — Design auto-recovery mechanisms for the 3 most likely failure modes.
5. ON-CALL — Define the on-call rotation structure and escalation paths.

For each area, provide: the design and a brief justification.

Format as a structured document with tables.

What This Prompt Covers vs. What It Misses

Skill	Lite Prompt (Free)	Full Prompt (Course)	Impact of Missing It
SLO definitions	✅ Covered	✅ Covered	—
Monitoring design	✅ Covered	✅ Covered	—
Alert definitions	✅ Covered	✅ Covered	—
Prometheus config files	❌ Missing	✅ Ready-to-deploy prometheus.yml	Design exists but takes 3 hours to implement — config is the hard part
Grafana dashboard JSON	❌ Missing	✅ Import-ready dashboard definitions	Dashboard "designed" but never built — no visibility in production
Error budget policy	❌ Missing	✅ "If error budget < 20%, freeze feature releases"	SLO violated for 2 weeks. No policy triggers. Team keeps shipping features. Reliability degrades.
Runbook content	⚠️ Surface-level	✅ Step-by-step commands for each failure mode	On-call engineer at 3am: "Check the runbook." Runbook says "investigate." Not actionable.
Post-incident review process	❌ Missing	✅ Blameless postmortem template	Incident happens. Nobody learns. Same incident in 3 months.

The Lite Prompt gets you to ~60% quality. Good enough to understand SRE. Not good enough to maintain 99.9% uptime.

Real-World Example: SRE for a Messaging Platform

The Requirement

"Design SRE for a messaging platform: real-time messages, online presence, and push notifications. Target: 99.95% uptime. p99 message delivery < 500ms."

Lite Prompt Output

① SLOs: Message delivery: 99.95% success, p99 < 500ms. Presence: 99.9% accuracy. Push: 99% delivered within 30s.

② Monitoring: Prometheus metrics for message latency, delivery rate, presence accuracy. Grafana dashboard per service.

③ Alerting: P0: delivery rate < 99% for 5 min. P1: p99 > 500ms for 10 min. P2: push delay > 60s.

④ Self-Healing: Auto-scale message service on high load. Auto-restart presence service on crash.

⑤ On-Call: Weekly rotation. Escalation: L1 → L2 → engineering lead.

What an SRE Manager Would Catch

Area	Lite Output Says	What's Missing	Real-World Consequence
SLOs	"99.95% uptime"	No error budget calculation. 99.95% = 21.9 min/month downtime. Is that acceptable?	Marketing says "99.95% uptime" but incident takes 25 minutes. Is that an SLO breach? Nobody agreed on measurement window.
Monitoring	"Prometheus metrics per service"	No distributed tracing. Message goes through 3 services — where's the latency?	p99 is 600ms. Which service? All three look healthy individually. No end-to-end trace.
Alerting	"delivery rate < 99% for 5 min"	No auto-resolved notification. Alert fires and stays firing. Is it still ongoing?	Alert fires at 2am. On-call wakes up. By the time laptop opens, it auto-resolved. No "resolved" notification. Wasted wake-up.
Self-Healing	"Auto-scale on high load"	No pre-warming. Auto-scaling takes 3 minutes. Users experience degradation during scale-up.	Viral event at 8pm. Traffic 5x. Auto-scaling kicks in after 3 minutes. Those 3 minutes = degraded experience for 100K users.
On-Call	"Weekly rotation"	No handoff protocol. No on-call quality review. No compensation model.	Outgoing on-call had an ongoing incident. Incoming on-call doesn't know. Continuity lost. Incident drags.

The pattern: The Lite Prompt asks "what's the SRE plan?" The full course asks "what's the plan, what's the error budget policy, and what happens at 3am?"

Ready to Maintain 99.9% Uptime?

✅ The complete prompt with Prometheus configs, Grafana dashboards, and runbook templates
✅ An AI agent that monitors and self-heals infrastructure
✅ Assessment + coding challenges to verify you can operate, not just design

Enroll in the Fresh Graduate AI SDLC Course →
Go from "I understand SRE" to "I can maintain 99.9% uptime for a production system."

← Chapter 14 Course Home Chapter 16 →