Prove Your System Survives Failure — Chaos Engineering

By the end of this page, you will understand how Chaos Testers inject failures — database blackouts, latency spikes, and zombie containers — to verify system resilience before production disasters do it for you.

Chaos Engineering — The 2-Minute Overview

Chapter 13 Cartoon — We Call It Chaos Testing Now

Think about the last time your city ran a fire drill in a large building. Nobody waits for a real fire to test the evacuation plan. They simulate the fire — kill the elevators, trigger the alarms, time the evacuation, and identify where people get stuck. That's Chaos Engineering for buildings. For software, we do the same: kill a database, inject latency, stop a container — and verify the system recovers.

graph LR subgraph INPUT["Chaos Inputs"] I1["System Architecture"] I2["CLAUDE.md Resilience Standards"] I3["SLOs / SLIs"] end subgraph CHAOS["Chaos Engineering"] C1["Database Blackout — Kill the DB"] C2["Latency Spike — Inject delays"] C3["Zombie Container — Stop the API"] end subgraph OUTPUT["Chaos Outputs"] O1["Resilience Report"] O2["Fail Fast Compliance"] O3["Recovery Time Measurements"] end I1 --> C1 I2 --> C2 I3 --> C3 C1 --> O1 C2 --> O2 C3 --> O3 style INPUT fill:#16213e,stroke:#0f3460,color:#fff style CHAOS fill:#8b0000,stroke:#ff4444,color:#fff style OUTPUT fill:#006400,stroke:#00cc00,color:#fff

You Already Know Chaos Engineering — You Just Don't Know It Yet

You've been doing chaos engineering every time you tested a backup generator. Let's prove it.

🔌 The Power Outage Analogy

Step 1 — Kill the main power (Database Blackout). Does the backup generator start automatically?

🔗 Chaos Layer: ① DATABASE BLACKOUT — Kill the database during a critical operation. Does the system return a helpful error? Does it retry? Does data stay consistent?

Step 2 — Dim the lights gradually (Latency Spike). Does the building still function when power is at 50%?

🔗 Chaos Layer: ② LATENCY SPIKE — Introduce 5-second delays on network calls. Does the system timeout gracefully? Does the UI show a loading state or freeze?

Step 3 — One floor's circuit breaker trips (Zombie Container). Does the rest of the building keep running?

🔗 Chaos Layer: ③ ZOMBIE CONTAINER — Stop the API container. Does the load balancer route to healthy instances? Does the container auto-restart?

The Complete Mapping

Power OutageChaos EngineeringType
Main power killed → backup starts?Database killed → system degrades gracefully?① Database Blackout
Power at 50% → building functions?5-sec latency injected → system timeouts gracefully?② Latency Spike
One floor's circuit trips → other floors work?API container stopped → traffic routes to healthy instances?③ Zombie Container
You just learned Chaos Engineering without breaking a single system.


The 5 Pillars of Chaos Engineering

1. Database Blackout

The database is the most critical dependency. When it dies, does your system die too?

Kill the database connection during a write operation. Verify: Does the application return a meaningful error? Does it retry with exponential backoff? Is data consistent when the DB comes back? No lost writes? No duplicate entries?

ScenarioExpected BehaviorFail Criteria
DB killed during readReturn cached data or graceful errorUnhandled exception / blank screen
DB killed during writeRetry or queue for later, no data lossSilent data loss or partial write
DB recovers after 60sSystem auto-reconnects, no manual restartRequires pod restart to recover

2. Latency Spike

Latency isn't binary (working/broken). It's a spectrum — and your system must handle every point on it.

Inject 2s, 5s, and 10s delays on specific network calls. Verify: Are timeouts configured? Does the UI show loading indicators? Do downstream services queue rather than cascade-fail?

DelayExpected BehaviorFail Criteria
2-second delaySlightly slower response, UI shows loadingNo loading indicator, user re-clicks
5-second delayTimeout, return cached/default responseHangs indefinitely
10-second delayCircuit breaker triggers, return fallbackEntire system cascades and crashes

3. Zombie Container

Containers die. The question is: does anyone notice, and does the system recover?

Stop the API container. Verify: Does the load balancer detect the unhealthy instance? Does traffic route to healthy instances? Does the container orchestrator restart the zombie? What's the recovery time?

ScenarioExpected BehaviorFail Criteria
1 of 3 instances killedTraffic routes to remaining 2Requests fail until manual restart
Health check failsLoad balancer stops routing to itLoad balancer keeps sending traffic
Container restartOrchestrator restarts within 30sManual intervention required

4. Fail Fast Compliance

"Fail fast" means: detect failure immediately, report it clearly, and don't make it worse.

Verify compliance with CLAUDE.md's "fail fast" standards: errors are detected at the boundary where they occur, errors are propagated with context (not swallowed), and retry logic includes backoff and limits.

PrincipleWhat It MeansVerification
Detect at BoundaryError caught where it occurs, not 5 layers upCheck error origin vs. catch location
Propagate with ContextError message includes what, where, and whyVerify error payloads
Bounded RetryRetry with backoff, not infinite loopsInject failure, count retries

5. Recovery and Steady State

The system must not only survive failure — it must return to normal without human intervention.

After each chaos experiment, verify: Does the system return to its pre-experiment state? How long does recovery take? Is any data lost or corrupted? Define "steady state" (SLO-compliant performance) and measure time-to-steady-state after each experiment.

MetricDefinitionTarget
Time to DetectionHow fast is failure detected?< 30 seconds
Time to MitigationHow fast is impact reduced?< 2 minutes
Time to RecoveryHow fast is normal state restored?< 5 minutes

The Complete Mapping

#PillarWhat It AnswersKey Decision
Database BlackoutWhat happens when the DB dies?Retry, degrade, or crash?
Latency SpikeWhat happens when things get slow?Timeout, circuit break, or cascade?
Zombie ContainerWhat happens when a server dies?Auto-recover or manual restart?
Fail FastDo we detect and report errors correctly?Boundary detection, context propagation
RecoveryDo we return to normal automatically?TTD + TTM + TTR targets
Master these 5 pillars, master resilience.


Try It Yourself — A Starter Prompt for Chaos Testing

This prompt gives you a working starting point. For the complete prompt — with steady-state definitions, blast radius controls, and automated rollback verification — see the full course chapter →.
You are a Chaos Engineer with experience in resilience testing for distributed systems.

I need a chaos test plan for:

{{PASTE YOUR SYSTEM ARCHITECTURE AND SLOs}}

Cover these 5 areas:

1. DATABASE BLACKOUT — Define the experiment: what to kill, during what operation, what to verify.
2. LATENCY SPIKE — Define delay injection points, delay durations, and expected behavior.
3. ZOMBIE CONTAINER — Define which container to kill and container recovery expectations.
4. FAIL FAST VERIFICATION — How will you verify "fail fast" compliance per CLAUDE.md?
5. RECOVERY METRICS — Define time-to-detection, time-to-mitigation, and time-to-recovery targets.

For each area, provide: the experiment and pass/fail criteria.

Format as a structured document with tables.

What This Prompt Covers vs. What It Misses

SkillLite Prompt (Free)Full Prompt (Course)Impact of Missing It
Three chaos experiment types✅ Covered✅ Covered
Pass/fail criteria✅ Covered✅ Covered
Blast radius control❌ Missing✅ "Limit experiment to 10% of traffic"Chaos experiment takes down production. "We were testing resilience." Irony peak.
Steady-state hypothesis❌ Missing✅ "Before: system meets SLOs. After: system returns to SLOs"No baseline — can't tell if system recovered because you don't know what "normal" looks like
Automated rollback❌ Missing✅ "If error rate > 5%, auto-abort experiment"Experiment runs too long, real users affected for 30 minutes
Cascading failure detection❌ Missing✅ "Verify failure in Service A doesn't take down Service B"Database blackout also crashes the cache, which also crashes the API — undetected cascade
The Lite Prompt gets you to ~60% quality. Good enough to know what to test. Not good enough to run chaos experiments safely.


Real-World Example: Chaos Testing for an E-Commerce Checkout

The Requirement

"Chaos test the checkout service: payment processing, inventory reservation, and order confirmation. Verify survive: DB blackout during payment, 5s latency on inventory service, and payment container crash."

Lite Prompt Output

① DB Blackout: Kill PostgreSQL during payment write. Verify: payment doesn't go through partially.

② Latency: Inject 5s delay on inventory check. Verify: checkout shows loading state.

③ Zombie: Kill 1 of 2 payment containers. Verify: traffic routes to survivor.

④ Fail Fast: Errors should include payment ID and failure reason.

⑤ Recovery: System recovers within 5 minutes.


What a Resilience Lead Would Catch

AreaLite Output SaysWhat's MissingReal-World Consequence
DB Blackout"Payment doesn't go through partially"What about the inventory reservation? If payment fails, is inventory released?User's payment fails. Inventory remains reserved. Item shows "out of stock" for other users. Ghost reservation.
Latency"Checkout shows loading state"No timeout threshold. What if the user clicks "Pay" again during loading?User double-clicks. Two payment charges. Two orders. Refund required.
Zombie"Traffic routes to survivor"No cold-start test. When the killed container restarts, does it handle traffic immediately?Container restarts with cold cache. First 100 requests hit the DB directly. DB overloaded. Second outage.
Fail Fast"Errors should include payment ID"No verification method. How do you check this automatically?"We'll check the logs manually." Nobody checks. Errors without payment IDs. Support can't trace failures.
Recovery"5 minutes"No measurement methodology. From when? Container restart or full SLO compliance?Container restarts in 30 seconds. But error rate takes 10 minutes to normalize. "Recovered" when?
The pattern: The Lite Prompt says "inject failure, check behavior." The full course says "inject failure, control the blast radius, verify cascading effects, and measure recovery to SLO compliance."


Ready to Prove Your System Survives?

Enroll in the Fresh Graduate AI SDLC Course →

Go from "I understand chaos engineering" to "I can prove my system survives failure."
← Chapter 12 Course Home Chapter 14 →