Maintain 99.9% Uptime — Site Reliability Engineering
By the end of this page, you will understand how SREs define SLOs/SLIs, set up monitoring with Prometheus and Grafana, and build self-healing infrastructure — and how AI agents can automate operations.
Operations & Monitoring — The 2-Minute Overview
Think about the last time you stayed in a hotel. You didn't see the building management system — monitoring room temperatures, water pressure, elevator status, and fire alarms 24/7. You just had hot water and working Wi-Fi. But somebody had to define "acceptable" (water between 38-42°C), set up monitoring (sensors in every room), and build automatic responses (pressure drops → backup pump activates). That building management is SRE.
You Already Know SRE — You Just Don't Know It Yet
You've been an SRE every time you maintained an aquarium.
🐠 The Aquarium Analogy
Step 1 — Define SLOs: Water temperature: 24-26°C. pH: 7.0-7.5. Oxygen: >6 mg/L.
🔗 SRE Layer: ① SLOs/SLIs — Define what "healthy" means for each service.
Step 2 — Set up monitoring: Thermometer (always visible), pH test kit (weekly), oxygen meter (daily).
🔗 SRE Layer: ② MONITORING — Prometheus collects metrics, Grafana displays dashboards, alerts fire when thresholds are breached.
Step 3 — Self-healing: Heater automatically maintains temperature. Filter runs continuously. Auto-feeder dispenses food.
🔗 SRE Layer: ③ SELF-HEALING — Auto-scaling on high CPU, auto-restart on container crash, circuit breakers on dependency failure.
The Complete Mapping
| Aquarium | SRE | Phase |
|---|---|---|
| Temperature: 24-26°C, pH: 7.0-7.5 | SLOs: p99 < 200ms, uptime > 99.9% | ① Define SLOs |
| Thermometer, pH kit, oxygen meter | Prometheus metrics, Grafana dashboards, PagerDuty alerts | ② Monitor |
| Heater auto-adjusts, filter runs continuously | Auto-scaling, auto-restart, circuit breakers | ③ Self-Heal |
You just learned SRE without opening a terminal.
The 5 Pillars of Site Reliability Engineering
1. SLOs and SLIs
SLOs are promises to users. SLIs measure if you're keeping them. Error budgets tell you how much room you have.
Define an SLO for each critical user journey. Measure it with an SLI. Track your error budget — the amount of allowed unreliability.
| Concept | What It Means | Example |
|---|---|---|
| SLO | Target reliability level | 99.9% of login requests succeed in < 500ms |
| SLI | Actual measurement | This week: 99.7% succeeded in < 500ms |
| Error Budget | Allowed failures before action required | 0.1% in 30 days = 43 minutes of downtime allowed |
2. Monitoring with Prometheus & Grafana
If you're not measuring it, you're guessing. If you're measuring the wrong thing, you're confidently wrong.
Prometheus scrapes metrics (request count, latency, error rate, resource utilization). Grafana visualizes them in dashboards. Together, they answer: "Is the system healthy right now?" and "Is it trending toward unhealthy?"
| Component | What It Does | Use Case |
|---|---|---|
| Prometheus | Collects and stores time-series metrics | Latency, throughput, error rates |
| Grafana | Visualizes metrics in dashboards | Real-time health, trend analysis |
| Alertmanager | Routes alerts based on severity and team | Page on-call for P0, Slack for P2 |
3. Alerting Strategy
Every alert must be actionable. If an alert doesn't require human action, it's noise — remove it.
Good alerts: "p99 latency > 500ms for 5 minutes → investigate." Bad alerts: "CPU at 60%" — so what? Alert fatigue is real: too many alerts → humans ignore all of them → real incidents go unnoticed.
| Concept | What It Means | When It Applies |
|---|---|---|
| Actionable Alerts | Every alert has a clear human action | Every alert definition |
| Severity Levels | P0 (page immediately) vs. P2 (next business day) | Alert routing |
| Alert Fatigue | Too many alerts → humans ignore them | Alert review cadence |
4. Self-Healing Infrastructure
The best incident is the one that resolves itself before a human notices.
Auto-scaling: add instances when CPU > 70%. Auto-restart: container crashes → orchestrator restarts within 30 seconds. Circuit breakers: dependency fails → serve cached/default response. Health checks: load balancer stops routing to unhealthy instances.
| Mechanism | Trigger | Action |
|---|---|---|
| Auto-Scaling | CPU > 70% for 5 min | Add 2 instances |
| Auto-Restart | Health check fails 3 times | Restart container |
| Circuit Breaker | Dependency error rate > 50% | Return fallback response |
| Load Balancer | Health check HTTP 503 | Route away from instance |
5. Incident Response & On-Call
On-call isn't about being awake at 3am. It's about having runbooks that make 3am incidents solvable in 10 minutes.
On-call rotation ensures someone is always responsible. Runbooks provide step-by-step instructions for known failure modes. Escalation paths define who to call when runbooks aren't enough.
| Concept | What It Means | When It Applies |
|---|---|---|
| On-Call Rotation | Always someone responsible | 24/7 coverage |
| Runbooks | Step-by-step instructions per alert | Every known failure mode |
| Escalation | When to call the next level | Runbook doesn't resolve within 15 min |
The Complete Mapping
| # | Pillar | What It Answers | Key Decision |
|---|---|---|---|
| ① | SLOs/SLIs | What's "reliable enough"? | Targets + measurements + error budgets |
| ② | Monitoring | Is the system healthy? | Prometheus + Grafana + dashboards |
| ③ | Alerting | When should we act? | Actionable alerts, severity routing |
| ④ | Self-Healing | Can the system fix itself? | Auto-scale, restart, circuit break |
| ⑤ | Incident Response | What do we do when it breaks? | Rotation, runbooks, escalation |
Master these 5 pillars, master reliability.
Try It Yourself — A Starter Prompt for SRE
This prompt gives you a working starting point. For the complete prompt — with Prometheus configs, Grafana dashboard JSON, and runbook templates — see the full course chapter →.
You are a Senior SRE with experience in Prometheus, Grafana, and self-healing infrastructure.
I need an SRE plan for:
{{PASTE YOUR SYSTEM ARCHITECTURE AND RELIABILITY REQUIREMENTS}}
Cover these 5 areas:
1. SLOs — Define SLOs for the 3 most critical user journeys. Include error budgets.
2. MONITORING — Design the Prometheus metrics and Grafana dashboards needed.
3. ALERTING — Define alerts with severity, threshold, and required action.
4. SELF-HEALING — Design auto-recovery mechanisms for the 3 most likely failure modes.
5. ON-CALL — Define the on-call rotation structure and escalation paths.
For each area, provide: the design and a brief justification.
Format as a structured document with tables.
What This Prompt Covers vs. What It Misses
| Skill | Lite Prompt (Free) | Full Prompt (Course) | Impact of Missing It |
|---|---|---|---|
| SLO definitions | ✅ Covered | ✅ Covered | — |
| Monitoring design | ✅ Covered | ✅ Covered | — |
| Alert definitions | ✅ Covered | ✅ Covered | — |
| Prometheus config files | ❌ Missing | ✅ Ready-to-deploy prometheus.yml | Design exists but takes 3 hours to implement — config is the hard part |
| Grafana dashboard JSON | ❌ Missing | ✅ Import-ready dashboard definitions | Dashboard "designed" but never built — no visibility in production |
| Error budget policy | ❌ Missing | ✅ "If error budget < 20%, freeze feature releases" | SLO violated for 2 weeks. No policy triggers. Team keeps shipping features. Reliability degrades. |
| Runbook content | ⚠️ Surface-level | ✅ Step-by-step commands for each failure mode | On-call engineer at 3am: "Check the runbook." Runbook says "investigate." Not actionable. |
| Post-incident review process | ❌ Missing | ✅ Blameless postmortem template | Incident happens. Nobody learns. Same incident in 3 months. |
The Lite Prompt gets you to ~60% quality. Good enough to understand SRE. Not good enough to maintain 99.9% uptime.
Real-World Example: SRE for a Messaging Platform
The Requirement
"Design SRE for a messaging platform: real-time messages, online presence, and push notifications. Target: 99.95% uptime. p99 message delivery < 500ms."
Lite Prompt Output
① SLOs: Message delivery: 99.95% success, p99 < 500ms. Presence: 99.9% accuracy. Push: 99% delivered within 30s.
② Monitoring: Prometheus metrics for message latency, delivery rate, presence accuracy. Grafana dashboard per service.
③ Alerting: P0: delivery rate < 99% for 5 min. P1: p99 > 500ms for 10 min. P2: push delay > 60s.
④ Self-Healing: Auto-scale message service on high load. Auto-restart presence service on crash.
⑤ On-Call: Weekly rotation. Escalation: L1 → L2 → engineering lead.
What an SRE Manager Would Catch
| Area | Lite Output Says | What's Missing | Real-World Consequence |
|---|---|---|---|
| SLOs | "99.95% uptime" | No error budget calculation. 99.95% = 21.9 min/month downtime. Is that acceptable? | Marketing says "99.95% uptime" but incident takes 25 minutes. Is that an SLO breach? Nobody agreed on measurement window. |
| Monitoring | "Prometheus metrics per service" | No distributed tracing. Message goes through 3 services — where's the latency? | p99 is 600ms. Which service? All three look healthy individually. No end-to-end trace. |
| Alerting | "delivery rate < 99% for 5 min" | No auto-resolved notification. Alert fires and stays firing. Is it still ongoing? | Alert fires at 2am. On-call wakes up. By the time laptop opens, it auto-resolved. No "resolved" notification. Wasted wake-up. |
| Self-Healing | "Auto-scale on high load" | No pre-warming. Auto-scaling takes 3 minutes. Users experience degradation during scale-up. | Viral event at 8pm. Traffic 5x. Auto-scaling kicks in after 3 minutes. Those 3 minutes = degraded experience for 100K users. |
| On-Call | "Weekly rotation" | No handoff protocol. No on-call quality review. No compensation model. | Outgoing on-call had an ongoing incident. Incoming on-call doesn't know. Continuity lost. Incident drags. |
The pattern: The Lite Prompt asks "what's the SRE plan?" The full course asks "what's the plan, what's the error budget policy, and what happens at 3am?"
Ready to Maintain 99.9% Uptime?
- ✅ The complete prompt with Prometheus configs, Grafana dashboards, and runbook templates
- ✅ An AI agent that monitors and self-heals infrastructure
- ✅ Assessment + coding challenges to verify you can operate, not just design
Go from "I understand SRE" to "I can maintain 99.9% uptime for a production system."