Introduction
Chaos Monkey is a resiliency tool from Netflix that randomly terminates instances in production. It was born from the Simian Army philosophy: if your infrastructure cannot handle random failure gracefully, it is better to discover that during business hours than at 3 a.m.
What Chaos Monkey Does
- Randomly terminates virtual machine instances or containers in a target environment
- Integrates with Spinnaker to discover applications and their server groups
- Allows per-app opt-in/opt-out through Spinnaker application properties
- Uses a configurable schedule and probability to control termination frequency
- Tracks termination events in a MySQL database for auditing
Architecture Overview
Chaos Monkey is a standalone Go binary that talks to Spinnaker's REST API to enumerate applications, clusters, and server groups. On each scheduled run it selects eligible instances based on configurable mean-time-between-kills and grouping strategy, then calls Spinnaker to terminate the chosen instance. State and scheduling data persist in a MySQL backend.
Self-Hosting & Configuration
- Requires a running Spinnaker deployment with Chaos Monkey support enabled
- MySQL 5.6+ stores termination schedules and event history
- Configuration lives in a TOML file specifying Spinnaker endpoint, database DSN, and schedule
chaosmonkey migrateinitializes or upgrades the database schema- Environment-level toggles let you disable terminations without redeploying
Key Features
- Battle-tested at Netflix scale across thousands of microservices
- Configurable mean-time-between-kills per application or cluster
- Grouping strategies: app, stack, or cluster granularity
- Dry-run mode logs what would be terminated without acting
- MySQL-backed audit trail of every termination event
Comparison with Similar Tools
- Chaos Mesh — Kubernetes-native with network, I/O, and pod chaos; broader fault types but more complex setup
- Litmus — CNCF project offering experiment-as-code with a ChaosHub marketplace
- ChaosBlade — Alibaba's toolkit covering JVM, container, and network faults
- Gremlin — Commercial SaaS with managed chaos experiments and GameDay support
- Pumba — Docker-focused tool for container kill, pause, and network emulation
FAQ
Q: Does Chaos Monkey only work with AWS? A: It works with any cloud provider that Spinnaker supports, including GCP, Azure, and bare-metal Kubernetes.
Q: Can I limit which apps are affected? A: Yes. Each Spinnaker application opts in or out, and you can set per-app probability and grouping.
Q: Is a Spinnaker deployment mandatory? A: The open-source version requires Spinnaker for instance discovery and termination. Alternatives like Chaos Mesh or Litmus work without it.
Q: How is Chaos Monkey different from the broader Simian Army? A: Simian Army was a collection of tools (Latency Monkey, Conformity Monkey, etc.). Chaos Monkey is the instance-termination component, and the only one Netflix open-sourced in Go.