Prove Your System Survives Failure — Chaos Engineering

By the end of this page, you will understand how Chaos Testers inject failures — database blackouts, latency spikes, and zombie containers — to verify system resilience before production disasters do it for you.

Chaos Engineering — The 2-Minute Overview

Chapter 13 Cartoon — We Call It Chaos Testing Now

Think about the last time your city ran a fire drill in a large building. Nobody waits for a real fire to test the evacuation plan. They simulate the fire — kill the elevators, trigger the alarms, time the evacuation, and identify where people get stuck. That's Chaos Engineering for buildings. For software, we do the same: kill a database, inject latency, stop a container — and verify the system recovers.

graph LR subgraph INPUT["Chaos Inputs"] I1["System Architecture"] I2["CLAUDE.md Resilience Standards"] I3["SLOs / SLIs"] end subgraph CHAOS["Chaos Engineering"] C1["Database Blackout — Kill the DB"] C2["Latency Spike — Inject delays"] C3["Zombie Container — Stop the API"] end subgraph OUTPUT["Chaos Outputs"] O1["Resilience Report"] O2["Fail Fast Compliance"] O3["Recovery Time Measurements"] end I1 --> C1 I2 --> C2 I3 --> C3 C1 --> O1 C2 --> O2 C3 --> O3 style INPUT fill:#16213e,stroke:#0f3460,color:#fff style CHAOS fill:#8b0000,stroke:#ff4444,color:#fff style OUTPUT fill:#006400,stroke:#00cc00,color:#fff

You Already Know Chaos Engineering — You Just Don't Know It Yet

You've been doing chaos engineering every time you tested a backup generator. Let's prove it.

🔌 The Power Outage Analogy

Step 1 — Kill the main power (Database Blackout). Does the backup generator start automatically?

🔗 Chaos Layer: ① DATABASE BLACKOUT — Kill the database during a critical operation. Does the system return a helpful error? Does it retry? Does data stay consistent?

Step 2 — Dim the lights gradually (Latency Spike). Does the building still function when power is at 50%?

🔗 Chaos Layer: ② LATENCY SPIKE — Introduce 5-second delays on network calls. Does the system timeout gracefully? Does the UI show a loading state or freeze?

Step 3 — One floor's circuit breaker trips (Zombie Container). Does the rest of the building keep running?

🔗 Chaos Layer: ③ ZOMBIE CONTAINER — Stop the API container. Does the load balancer route to healthy instances? Does the container auto-restart?

The Complete Mapping

Power Outage	Chaos Engineering	Type
Main power killed → backup starts?	Database killed → system degrades gracefully?	① Database Blackout
Power at 50% → building functions?	5-sec latency injected → system timeouts gracefully?	② Latency Spike
One floor's circuit trips → other floors work?	API container stopped → traffic routes to healthy instances?	③ Zombie Container

You just learned Chaos Engineering without breaking a single system.

The 5 Pillars of Chaos Engineering

1. Database Blackout

The database is the most critical dependency. When it dies, does your system die too?

Kill the database connection during a write operation. Verify: Does the application return a meaningful error? Does it retry with exponential backoff? Is data consistent when the DB comes back? No lost writes? No duplicate entries?

Scenario	Expected Behavior	Fail Criteria
DB killed during read	Return cached data or graceful error	Unhandled exception / blank screen
DB killed during write	Retry or queue for later, no data loss	Silent data loss or partial write
DB recovers after 60s	System auto-reconnects, no manual restart	Requires pod restart to recover

2. Latency Spike

Latency isn't binary (working/broken). It's a spectrum — and your system must handle every point on it.

Inject 2s, 5s, and 10s delays on specific network calls. Verify: Are timeouts configured? Does the UI show loading indicators? Do downstream services queue rather than cascade-fail?

Delay	Expected Behavior	Fail Criteria
2-second delay	Slightly slower response, UI shows loading	No loading indicator, user re-clicks
5-second delay	Timeout, return cached/default response	Hangs indefinitely
10-second delay	Circuit breaker triggers, return fallback	Entire system cascades and crashes

3. Zombie Container

Containers die. The question is: does anyone notice, and does the system recover?

Stop the API container. Verify: Does the load balancer detect the unhealthy instance? Does traffic route to healthy instances? Does the container orchestrator restart the zombie? What's the recovery time?

Scenario	Expected Behavior	Fail Criteria
1 of 3 instances killed	Traffic routes to remaining 2	Requests fail until manual restart
Health check fails	Load balancer stops routing to it	Load balancer keeps sending traffic
Container restart	Orchestrator restarts within 30s	Manual intervention required

4. Fail Fast Compliance

"Fail fast" means: detect failure immediately, report it clearly, and don't make it worse.

Verify compliance with CLAUDE.md's "fail fast" standards: errors are detected at the boundary where they occur, errors are propagated with context (not swallowed), and retry logic includes backoff and limits.

Principle	What It Means	Verification
Detect at Boundary	Error caught where it occurs, not 5 layers up	Check error origin vs. catch location
Propagate with Context	Error message includes what, where, and why	Verify error payloads
Bounded Retry	Retry with backoff, not infinite loops	Inject failure, count retries

5. Recovery and Steady State

The system must not only survive failure — it must return to normal without human intervention.

After each chaos experiment, verify: Does the system return to its pre-experiment state? How long does recovery take? Is any data lost or corrupted? Define "steady state" (SLO-compliant performance) and measure time-to-steady-state after each experiment.

Metric	Definition	Target
Time to Detection	How fast is failure detected?	< 30 seconds
Time to Mitigation	How fast is impact reduced?	< 2 minutes
Time to Recovery	How fast is normal state restored?	< 5 minutes

The Complete Mapping

#	Pillar	What It Answers	Key Decision
①	Database Blackout	What happens when the DB dies?	Retry, degrade, or crash?
②	Latency Spike	What happens when things get slow?	Timeout, circuit break, or cascade?
③	Zombie Container	What happens when a server dies?	Auto-recover or manual restart?
④	Fail Fast	Do we detect and report errors correctly?	Boundary detection, context propagation
⑤	Recovery	Do we return to normal automatically?	TTD + TTM + TTR targets

Master these 5 pillars, master resilience.

Try It Yourself — A Starter Prompt for Chaos Testing

This prompt gives you a working starting point. For the complete prompt — with steady-state definitions, blast radius controls, and automated rollback verification — see the full course chapter →.

You are a Chaos Engineer with experience in resilience testing for distributed systems.

I need a chaos test plan for:

{{PASTE YOUR SYSTEM ARCHITECTURE AND SLOs}}

Cover these 5 areas:

1. DATABASE BLACKOUT — Define the experiment: what to kill, during what operation, what to verify.
2. LATENCY SPIKE — Define delay injection points, delay durations, and expected behavior.
3. ZOMBIE CONTAINER — Define which container to kill and container recovery expectations.
4. FAIL FAST VERIFICATION — How will you verify "fail fast" compliance per CLAUDE.md?
5. RECOVERY METRICS — Define time-to-detection, time-to-mitigation, and time-to-recovery targets.

For each area, provide: the experiment and pass/fail criteria.

Format as a structured document with tables.

What This Prompt Covers vs. What It Misses

Skill	Lite Prompt (Free)	Full Prompt (Course)	Impact of Missing It
Three chaos experiment types	✅ Covered	✅ Covered	—
Pass/fail criteria	✅ Covered	✅ Covered	—
Blast radius control	❌ Missing	✅ "Limit experiment to 10% of traffic"	Chaos experiment takes down production. "We were testing resilience." Irony peak.
Steady-state hypothesis	❌ Missing	✅ "Before: system meets SLOs. After: system returns to SLOs"	No baseline — can't tell if system recovered because you don't know what "normal" looks like
Automated rollback	❌ Missing	✅ "If error rate > 5%, auto-abort experiment"	Experiment runs too long, real users affected for 30 minutes
Cascading failure detection	❌ Missing	✅ "Verify failure in Service A doesn't take down Service B"	Database blackout also crashes the cache, which also crashes the API — undetected cascade

The Lite Prompt gets you to ~60% quality. Good enough to know what to test. Not good enough to run chaos experiments safely.

Real-World Example: Chaos Testing for an E-Commerce Checkout

The Requirement

"Chaos test the checkout service: payment processing, inventory reservation, and order confirmation. Verify survive: DB blackout during payment, 5s latency on inventory service, and payment container crash."

Lite Prompt Output

① DB Blackout: Kill PostgreSQL during payment write. Verify: payment doesn't go through partially.

② Latency: Inject 5s delay on inventory check. Verify: checkout shows loading state.

③ Zombie: Kill 1 of 2 payment containers. Verify: traffic routes to survivor.

④ Fail Fast: Errors should include payment ID and failure reason.

⑤ Recovery: System recovers within 5 minutes.

What a Resilience Lead Would Catch

Area	Lite Output Says	What's Missing	Real-World Consequence
DB Blackout	"Payment doesn't go through partially"	What about the inventory reservation? If payment fails, is inventory released?	User's payment fails. Inventory remains reserved. Item shows "out of stock" for other users. Ghost reservation.
Latency	"Checkout shows loading state"	No timeout threshold. What if the user clicks "Pay" again during loading?	User double-clicks. Two payment charges. Two orders. Refund required.
Zombie	"Traffic routes to survivor"	No cold-start test. When the killed container restarts, does it handle traffic immediately?	Container restarts with cold cache. First 100 requests hit the DB directly. DB overloaded. Second outage.
Fail Fast	"Errors should include payment ID"	No verification method. How do you check this automatically?	"We'll check the logs manually." Nobody checks. Errors without payment IDs. Support can't trace failures.
Recovery	"5 minutes"	No measurement methodology. From when? Container restart or full SLO compliance?	Container restarts in 30 seconds. But error rate takes 10 minutes to normalize. "Recovered" when?

The pattern: The Lite Prompt says "inject failure, check behavior." The full course says "inject failure, control the blast radius, verify cascading effects, and measure recovery to SLO compliance."

Ready to Prove Your System Survives?

✅ The complete prompt with blast radius controls, steady-state hypotheses, and cascading failure detection
✅ An AI agent that injects failures and verifies resilience
✅ Assessment + coding challenges to verify you can chaos-test, not just describe experiments

Enroll in the Fresh Graduate AI SDLC Course →
Go from "I understand chaos engineering" to "I can prove my system survives failure."

← Chapter 12 Course Home Chapter 14 →