Blog
September 12, 2018 Marie H.

Chaos Engineering: Breaking Things on Purpose

Chaos Engineering: Breaking Things on Purpose

Photo by <a href="https://unsplash.com/@etiennegirardet?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Etienne Girardet</a> on <a href="https://unsplash.com/?utm_source=cloudista&utm_medium=referral" target="_blank" rel="noopener">Unsplash</a>

Chaos Engineering: Breaking Things on Purpose

There's a question I ask every team I work with when we start talking about reliability: "How do you know your system is resilient?" The answer I get most often is some version of "we think it is" or "we have monitoring." That's not good enough. Chaos engineering is the discipline of finding out for real.

The Netflix Origin

Chaos Monkey came out of Netflix in 2011. The premise was simple and deliberately confrontational: randomly terminate EC2 instances in production during business hours. If you know your instances can die at any time, you build your systems to handle it. If the first time you find out your system can't handle an instance failure is at 2am during a real outage, that's worse.

That one tool evolved into a full discipline. Netflix eventually published the Principles of Chaos Engineering, which codifies the approach:

  • Build a hypothesis around steady-state behavior
  • Vary real-world events (instance failures, network latency, disk pressure)
  • Run experiments in production
  • Automate experiments to run continuously
  • Minimize blast radius

The last point is important. This isn't reckless. It's the opposite of reckless — you're choosing the time, scope, and impact of your failures instead of letting production choose for you.

The Structure of a Chaos Experiment

Good chaos experiments have a clear structure. You don't just randomly break things and see what happens — at least not at first.

1. Define steady state. What does healthy look like? This needs to be measurable: error rate below 0.5%, p99 latency under 300ms, all pods Running, ASG at desired capacity. Vague health definitions make it impossible to know if an experiment revealed anything.

2. Form a hypothesis. "I believe that if I terminate one pod in the API deployment, the remaining pods will absorb the traffic within 30 seconds and the error rate will stay below 1%." Now you have something to test.

3. Define blast radius. What's the worst case if the hypothesis is wrong? Can you limit the scope so that "wrong" means a degraded experience for some users rather than a full outage? Scope the experiment to match your confidence level.

4. Introduce the variable. Terminate the pod. Inject the latency. Fill the disk. Do the thing.

5. Observe. Was your hypothesis correct? Watch your monitoring during the experiment. If the system behaved as expected, your confidence in that failure mode increases. If it didn't, you found a real weakness before production found it for you.

6. Learn and repeat. If you found a weakness, fix it and run the experiment again. Verify the fix actually works.

The hypothesis is what separates chaos engineering from chaos. Articulate your assumptions before you test them.

Tooling

Gremlin is the commercial option and it's polished. You install an agent on your hosts or in your cluster, then use the Gremlin UI or API to apply attacks: CPU, memory, disk, network latency, packet loss, process kills. The blast radius controls are good — you can target by tag, by service, by percentage. It has a rollback button that actually works. If you're in an org that needs audit trails and approval workflows for chaos experiments, Gremlin is the pragmatic choice.

# Install Gremlin agent on Linux
curl https://rpm.gremlin.com/gremlin.repo -o /etc/yum.repos.d/gremlin.repo
yum install -y gremlin gremlind

# Authenticate
gremlin init

# Terminate a random container in a pod (via CLI)
gremlin attack container --length 60 --target random --impact shutdown

Chaos Monkey for ASGs is the original open-source tool from Netflix, now part of the Spinnaker ecosystem. It's designed specifically for randomly terminating instances in Auto Scaling Groups on a schedule. If you're not running Spinnaker, setup friction is real — it depends on Simian Army infrastructure. Worth knowing about, but Gremlin or rolling your own is more practical for most teams.

chaoskube is my preferred Kubernetes-specific tool for pod killing. It's a simple deployment that watches your cluster and randomly deletes pods on a configurable schedule. No agent installation, no external dependencies, runs entirely in the cluster.

# Install chaoskube via Helm
helm install chaoskube stable/chaoskube \
  --set interval=10m \
  --set namespaces=production \
  --set labels="app=api" \
  --set dryRun=false

# Or run it once manually with kubectl
kubectl delete pod -n production -l app=api --field-selector=status.phase=Running \
  $(kubectl get pods -n production -l app=api -o name | shuf -n 1 | sed 's|pod/||')

The labels filter on chaoskube is key — you can target a specific service rather than deleting pods at random across the whole cluster.

What Happens When You Kill Half Your API Pods

Let me walk through a real experiment I ran on a client's staging cluster. They had a 6-pod API deployment behind a Kubernetes Service and an ALB ingress. Hypothesis: killing 3 pods should cause a brief spike in errors while Kubernetes reschedules them, but the remaining 3 should absorb traffic within 60 seconds and the overall error rate should stay below 5%.

# Start a load test in another terminal first
hey -n 10000 -c 50 -q 100 https://api.staging.example.com/health

# Kill half the pods
kubectl delete pods -n production -l app=api \
  $(kubectl get pods -n production -l app=api -o name | shuf -n 3 | tr '\n' ' ' | sed 's|pod/||g')

What we observed: error rate hit 12% for about 45 seconds, then came back down to 0%. Recovery time was acceptable, but the error spike was higher than hypothesized. We dug into why.

The problem was the default terminationGracePeriodSeconds of 30 seconds combined with the ALB not draining connections before the pod received SIGTERM. Connections were being dropped mid-request. The fix was adding a preStop hook with a sleep to give the load balancer time to deregister the pod before it stopped accepting connections:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]

Re-ran the experiment. Error rate peak dropped to 0.3%. That's what a chaos experiment is supposed to produce: a real configuration problem that no amount of code review would have caught.

Blast Radius and Starting in Non-Prod

I'll say this plainly: do not start chaos experiments in production.

Start in staging. Learn what breaks. Fix it. Run the experiment again. Once you have confidence that the experiment produces the expected result consistently, and you understand what "unexpected" looks like, then you can graduate it to production with a tightly scoped blast radius.

When you do run in production, limit scope aggressively. Kill one pod, not three. Inject 50ms of latency on 5% of requests, not all of them. Have a kill switch ready — know exactly how to stop the experiment immediately. For chaoskube, that's scaling the deployment to 0 replicas. For Gremlin, it's the halt button. Make sure someone is watching monitoring the entire time the experiment runs.

Game Days

The highest-leverage chaos engineering practice isn't automated experiments — it's game days. A game day is a scheduled event where you deliberately introduce failures in production (or as-close-to-production-as-you-can-get) and observe both how the system responds and how the team responds.

The system part you can test with tooling. The team part — can your on-call engineer find the problem? Is the runbook accurate? Do your alerts fire? Does the escalation path work? — you can only test with real humans in a real scenario.

Run a game day every quarter at minimum. Document what you find. Fix the gaps. Game days are uncomfortable the first time you run one. That discomfort is valuable information.

Where to Start

You don't need Gremlin and a sophisticated chaos platform to start. You need a hypothesis, a measurable steady state, and the willingness to break something in staging.

Start with the most basic experiment: kill one instance, one pod, one process. Watch what happens. I guarantee you'll find something you didn't expect. That's not a failure — that's the point.

The teams that are best at reliability aren't the ones who've never had an outage. They're the ones who practice failing so that when something actually breaks, it's nothing they haven't seen before.