Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 15, 2026·3 min de lectura

Chaos Monkey — Random Instance Failure Injection by Netflix

Chaos Monkey randomly terminates virtual machine instances and containers in production to encourage engineers to build resilient services that tolerate unexpected failures.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Chaos Monkey Overview
Comando CLI universal
npx tokrepo install 0041af30-5058-11f1-9bc6-00163e2b0d79

Introduction

Chaos Monkey is a resiliency tool from Netflix that randomly terminates instances in production. It was born from the Simian Army philosophy: if your infrastructure cannot handle random failure gracefully, it is better to discover that during business hours than at 3 a.m.

What Chaos Monkey Does

  • Randomly terminates virtual machine instances or containers in a target environment
  • Integrates with Spinnaker to discover applications and their server groups
  • Allows per-app opt-in/opt-out through Spinnaker application properties
  • Uses a configurable schedule and probability to control termination frequency
  • Tracks termination events in a MySQL database for auditing

Architecture Overview

Chaos Monkey is a standalone Go binary that talks to Spinnaker's REST API to enumerate applications, clusters, and server groups. On each scheduled run it selects eligible instances based on configurable mean-time-between-kills and grouping strategy, then calls Spinnaker to terminate the chosen instance. State and scheduling data persist in a MySQL backend.

Self-Hosting & Configuration

  • Requires a running Spinnaker deployment with Chaos Monkey support enabled
  • MySQL 5.6+ stores termination schedules and event history
  • Configuration lives in a TOML file specifying Spinnaker endpoint, database DSN, and schedule
  • chaosmonkey migrate initializes or upgrades the database schema
  • Environment-level toggles let you disable terminations without redeploying

Key Features

  • Battle-tested at Netflix scale across thousands of microservices
  • Configurable mean-time-between-kills per application or cluster
  • Grouping strategies: app, stack, or cluster granularity
  • Dry-run mode logs what would be terminated without acting
  • MySQL-backed audit trail of every termination event

Comparison with Similar Tools

  • Chaos Mesh — Kubernetes-native with network, I/O, and pod chaos; broader fault types but more complex setup
  • Litmus — CNCF project offering experiment-as-code with a ChaosHub marketplace
  • ChaosBlade — Alibaba's toolkit covering JVM, container, and network faults
  • Gremlin — Commercial SaaS with managed chaos experiments and GameDay support
  • Pumba — Docker-focused tool for container kill, pause, and network emulation

FAQ

Q: Does Chaos Monkey only work with AWS? A: It works with any cloud provider that Spinnaker supports, including GCP, Azure, and bare-metal Kubernetes.

Q: Can I limit which apps are affected? A: Yes. Each Spinnaker application opts in or out, and you can set per-app probability and grouping.

Q: Is a Spinnaker deployment mandatory? A: The open-source version requires Spinnaker for instance discovery and termination. Alternatives like Chaos Mesh or Litmus work without it.

Q: How is Chaos Monkey different from the broader Simian Army? A: Simian Army was a collection of tools (Latency Monkey, Conformity Monkey, etc.). Chaos Monkey is the instance-termination component, and the only one Netflix open-sourced in Go.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados