# Chaos Monkey — Random Instance Failure Injection by Netflix > Chaos Monkey randomly terminates virtual machine instances and containers in production to encourage engineers to build resilient services that tolerate unexpected failures. ## Install Save as a script file and run: # Chaos Monkey — Random Instance Failure Injection by Netflix ## Quick Use ```bash # Clone and build git clone https://github.com/Netflix/chaosmonkey.git cd chaosmonkey go build -o chaosmonkey ./cmd/chaosmonkey # Configure with Spinnaker endpoint export CHAOSMONKEY_ENABLED=true ./chaosmonkey migrate # set up the MySQL schema ./chaosmonkey schedule # enroll apps for termination ``` ## Introduction Chaos Monkey is a resiliency tool from Netflix that randomly terminates instances in production. It was born from the Simian Army philosophy: if your infrastructure cannot handle random failure gracefully, it is better to discover that during business hours than at 3 a.m. ## What Chaos Monkey Does - Randomly terminates virtual machine instances or containers in a target environment - Integrates with Spinnaker to discover applications and their server groups - Allows per-app opt-in/opt-out through Spinnaker application properties - Uses a configurable schedule and probability to control termination frequency - Tracks termination events in a MySQL database for auditing ## Architecture Overview Chaos Monkey is a standalone Go binary that talks to Spinnaker's REST API to enumerate applications, clusters, and server groups. On each scheduled run it selects eligible instances based on configurable mean-time-between-kills and grouping strategy, then calls Spinnaker to terminate the chosen instance. State and scheduling data persist in a MySQL backend. ## Self-Hosting & Configuration - Requires a running Spinnaker deployment with Chaos Monkey support enabled - MySQL 5.6+ stores termination schedules and event history - Configuration lives in a TOML file specifying Spinnaker endpoint, database DSN, and schedule - `chaosmonkey migrate` initializes or upgrades the database schema - Environment-level toggles let you disable terminations without redeploying ## Key Features - Battle-tested at Netflix scale across thousands of microservices - Configurable mean-time-between-kills per application or cluster - Grouping strategies: app, stack, or cluster granularity - Dry-run mode logs what would be terminated without acting - MySQL-backed audit trail of every termination event ## Comparison with Similar Tools - **Chaos Mesh** — Kubernetes-native with network, I/O, and pod chaos; broader fault types but more complex setup - **Litmus** — CNCF project offering experiment-as-code with a ChaosHub marketplace - **ChaosBlade** — Alibaba's toolkit covering JVM, container, and network faults - **Gremlin** — Commercial SaaS with managed chaos experiments and GameDay support - **Pumba** — Docker-focused tool for container kill, pause, and network emulation ## FAQ **Q: Does Chaos Monkey only work with AWS?** A: It works with any cloud provider that Spinnaker supports, including GCP, Azure, and bare-metal Kubernetes. **Q: Can I limit which apps are affected?** A: Yes. Each Spinnaker application opts in or out, and you can set per-app probability and grouping. **Q: Is a Spinnaker deployment mandatory?** A: The open-source version requires Spinnaker for instance discovery and termination. Alternatives like Chaos Mesh or Litmus work without it. **Q: How is Chaos Monkey different from the broader Simian Army?** A: Simian Army was a collection of tools (Latency Monkey, Conformity Monkey, etc.). Chaos Monkey is the instance-termination component, and the only one Netflix open-sourced in Go. ## Sources - https://github.com/Netflix/chaosmonkey - https://netflix.github.io/chaosmonkey/ --- Source: https://tokrepo.com/en/workflows/asset-0041af30 Author: Script Depot