Maintain 99.9% Uptime — Site Reliability Engineering

By the end of this page, you will understand how SREs define SLOs/SLIs, set up monitoring with Prometheus and Grafana, and build self-healing infrastructure — and how AI agents can automate operations.

Operations & Monitoring — The 2-Minute Overview

Chapter 15 Cartoon — The 3 AM Alert

Think about the last time you stayed in a hotel. You didn't see the building management system — monitoring room temperatures, water pressure, elevator status, and fire alarms 24/7. You just had hot water and working Wi-Fi. But somebody had to define "acceptable" (water between 38-42°C), set up monitoring (sensors in every room), and build automatic responses (pressure drops → backup pump activates). That building management is SRE.

graph LR subgraph INPUT["SRE Inputs"] I1["System Architecture"] I2["Business Requirements"] I3["Incident History"] end subgraph SRE["SRE Operations"] S1["Define SLOs / SLIs"] S2["Set Up Monitoring & Alerting"] S3["Build Self-Healing Infrastructure"] end subgraph OUTPUT["SRE Outputs"] O1["Prometheus/Grafana Dashboards"] O2["Alert Runbooks"] O3["Auto-Recovery Systems"] end I1 --> S1 I2 --> S1 I3 --> S3 S1 --> S2 S2 --> S3 S3 --> O1 S3 --> O2 S3 --> O3 style INPUT fill:#16213e,stroke:#0f3460,color:#fff style SRE fill:#1a1a2e,stroke:#e94560,color:#fff style OUTPUT fill:#006400,stroke:#00cc00,color:#fff

You Already Know SRE — You Just Don't Know It Yet

You've been an SRE every time you maintained an aquarium.

🐠 The Aquarium Analogy

Step 1 — Define SLOs: Water temperature: 24-26°C. pH: 7.0-7.5. Oxygen: >6 mg/L.

🔗 SRE Layer: ① SLOs/SLIs — Define what "healthy" means for each service.

Step 2 — Set up monitoring: Thermometer (always visible), pH test kit (weekly), oxygen meter (daily).

🔗 SRE Layer: ② MONITORING — Prometheus collects metrics, Grafana displays dashboards, alerts fire when thresholds are breached.

Step 3 — Self-healing: Heater automatically maintains temperature. Filter runs continuously. Auto-feeder dispenses food.

🔗 SRE Layer: ③ SELF-HEALING — Auto-scaling on high CPU, auto-restart on container crash, circuit breakers on dependency failure.

The Complete Mapping

AquariumSREPhase
Temperature: 24-26°C, pH: 7.0-7.5SLOs: p99 < 200ms, uptime > 99.9%① Define SLOs
Thermometer, pH kit, oxygen meterPrometheus metrics, Grafana dashboards, PagerDuty alerts② Monitor
Heater auto-adjusts, filter runs continuouslyAuto-scaling, auto-restart, circuit breakers③ Self-Heal
You just learned SRE without opening a terminal.


The 5 Pillars of Site Reliability Engineering

1. SLOs and SLIs

SLOs are promises to users. SLIs measure if you're keeping them. Error budgets tell you how much room you have.

Define an SLO for each critical user journey. Measure it with an SLI. Track your error budget — the amount of allowed unreliability.

ConceptWhat It MeansExample
SLOTarget reliability level99.9% of login requests succeed in < 500ms
SLIActual measurementThis week: 99.7% succeeded in < 500ms
Error BudgetAllowed failures before action required0.1% in 30 days = 43 minutes of downtime allowed

2. Monitoring with Prometheus & Grafana

If you're not measuring it, you're guessing. If you're measuring the wrong thing, you're confidently wrong.

Prometheus scrapes metrics (request count, latency, error rate, resource utilization). Grafana visualizes them in dashboards. Together, they answer: "Is the system healthy right now?" and "Is it trending toward unhealthy?"

ComponentWhat It DoesUse Case
PrometheusCollects and stores time-series metricsLatency, throughput, error rates
GrafanaVisualizes metrics in dashboardsReal-time health, trend analysis
AlertmanagerRoutes alerts based on severity and teamPage on-call for P0, Slack for P2

3. Alerting Strategy

Every alert must be actionable. If an alert doesn't require human action, it's noise — remove it.

Good alerts: "p99 latency > 500ms for 5 minutes → investigate." Bad alerts: "CPU at 60%" — so what? Alert fatigue is real: too many alerts → humans ignore all of them → real incidents go unnoticed.

ConceptWhat It MeansWhen It Applies
Actionable AlertsEvery alert has a clear human actionEvery alert definition
Severity LevelsP0 (page immediately) vs. P2 (next business day)Alert routing
Alert FatigueToo many alerts → humans ignore themAlert review cadence

4. Self-Healing Infrastructure

The best incident is the one that resolves itself before a human notices.

Auto-scaling: add instances when CPU > 70%. Auto-restart: container crashes → orchestrator restarts within 30 seconds. Circuit breakers: dependency fails → serve cached/default response. Health checks: load balancer stops routing to unhealthy instances.

MechanismTriggerAction
Auto-ScalingCPU > 70% for 5 minAdd 2 instances
Auto-RestartHealth check fails 3 timesRestart container
Circuit BreakerDependency error rate > 50%Return fallback response
Load BalancerHealth check HTTP 503Route away from instance

5. Incident Response & On-Call

On-call isn't about being awake at 3am. It's about having runbooks that make 3am incidents solvable in 10 minutes.

On-call rotation ensures someone is always responsible. Runbooks provide step-by-step instructions for known failure modes. Escalation paths define who to call when runbooks aren't enough.

ConceptWhat It MeansWhen It Applies
On-Call RotationAlways someone responsible24/7 coverage
RunbooksStep-by-step instructions per alertEvery known failure mode
EscalationWhen to call the next levelRunbook doesn't resolve within 15 min

The Complete Mapping

#PillarWhat It AnswersKey Decision
SLOs/SLIsWhat's "reliable enough"?Targets + measurements + error budgets
MonitoringIs the system healthy?Prometheus + Grafana + dashboards
AlertingWhen should we act?Actionable alerts, severity routing
Self-HealingCan the system fix itself?Auto-scale, restart, circuit break
Incident ResponseWhat do we do when it breaks?Rotation, runbooks, escalation
Master these 5 pillars, master reliability.


Try It Yourself — A Starter Prompt for SRE

This prompt gives you a working starting point. For the complete prompt — with Prometheus configs, Grafana dashboard JSON, and runbook templates — see the full course chapter →.
You are a Senior SRE with experience in Prometheus, Grafana, and self-healing infrastructure.

I need an SRE plan for:

{{PASTE YOUR SYSTEM ARCHITECTURE AND RELIABILITY REQUIREMENTS}}

Cover these 5 areas:

1. SLOs — Define SLOs for the 3 most critical user journeys. Include error budgets.
2. MONITORING — Design the Prometheus metrics and Grafana dashboards needed.
3. ALERTING — Define alerts with severity, threshold, and required action.
4. SELF-HEALING — Design auto-recovery mechanisms for the 3 most likely failure modes.
5. ON-CALL — Define the on-call rotation structure and escalation paths.

For each area, provide: the design and a brief justification.

Format as a structured document with tables.

What This Prompt Covers vs. What It Misses

SkillLite Prompt (Free)Full Prompt (Course)Impact of Missing It
SLO definitions✅ Covered✅ Covered
Monitoring design✅ Covered✅ Covered
Alert definitions✅ Covered✅ Covered
Prometheus config files❌ Missing✅ Ready-to-deploy prometheus.ymlDesign exists but takes 3 hours to implement — config is the hard part
Grafana dashboard JSON❌ Missing✅ Import-ready dashboard definitionsDashboard "designed" but never built — no visibility in production
Error budget policy❌ Missing✅ "If error budget < 20%, freeze feature releases"SLO violated for 2 weeks. No policy triggers. Team keeps shipping features. Reliability degrades.
Runbook content⚠️ Surface-level✅ Step-by-step commands for each failure modeOn-call engineer at 3am: "Check the runbook." Runbook says "investigate." Not actionable.
Post-incident review process❌ Missing✅ Blameless postmortem templateIncident happens. Nobody learns. Same incident in 3 months.
The Lite Prompt gets you to ~60% quality. Good enough to understand SRE. Not good enough to maintain 99.9% uptime.


Real-World Example: SRE for a Messaging Platform

The Requirement

"Design SRE for a messaging platform: real-time messages, online presence, and push notifications. Target: 99.95% uptime. p99 message delivery < 500ms."

Lite Prompt Output

① SLOs: Message delivery: 99.95% success, p99 < 500ms. Presence: 99.9% accuracy. Push: 99% delivered within 30s.

② Monitoring: Prometheus metrics for message latency, delivery rate, presence accuracy. Grafana dashboard per service.

③ Alerting: P0: delivery rate < 99% for 5 min. P1: p99 > 500ms for 10 min. P2: push delay > 60s.

④ Self-Healing: Auto-scale message service on high load. Auto-restart presence service on crash.

⑤ On-Call: Weekly rotation. Escalation: L1 → L2 → engineering lead.


What an SRE Manager Would Catch

AreaLite Output SaysWhat's MissingReal-World Consequence
SLOs"99.95% uptime"No error budget calculation. 99.95% = 21.9 min/month downtime. Is that acceptable?Marketing says "99.95% uptime" but incident takes 25 minutes. Is that an SLO breach? Nobody agreed on measurement window.
Monitoring"Prometheus metrics per service"No distributed tracing. Message goes through 3 services — where's the latency?p99 is 600ms. Which service? All three look healthy individually. No end-to-end trace.
Alerting"delivery rate < 99% for 5 min"No auto-resolved notification. Alert fires and stays firing. Is it still ongoing?Alert fires at 2am. On-call wakes up. By the time laptop opens, it auto-resolved. No "resolved" notification. Wasted wake-up.
Self-Healing"Auto-scale on high load"No pre-warming. Auto-scaling takes 3 minutes. Users experience degradation during scale-up.Viral event at 8pm. Traffic 5x. Auto-scaling kicks in after 3 minutes. Those 3 minutes = degraded experience for 100K users.
On-Call"Weekly rotation"No handoff protocol. No on-call quality review. No compensation model.Outgoing on-call had an ongoing incident. Incoming on-call doesn't know. Continuity lost. Incident drags.
The pattern: The Lite Prompt asks "what's the SRE plan?" The full course asks "what's the plan, what's the error budget policy, and what happens at 3am?"


Ready to Maintain 99.9% Uptime?

Enroll in the Fresh Graduate AI SDLC Course →

Go from "I understand SRE" to "I can maintain 99.9% uptime for a production system."
← Chapter 14 Course Home Chapter 16 →