Introduction
Chaos Mesh is a CNCF incubating project that lets platform teams run controlled, reproducible failure experiments against a live Kubernetes cluster. By expressing faults as CRDs, teams can version, schedule, and gate experiments in CI the same way they manage any other Kubernetes resource — essential for proving resilience claims.
What Chaos Mesh Does
- Injects pod failures (kill, container-kill, pod-failure)
- Simulates network chaos: latency, packet loss, bandwidth throttle, partition
- Faults disk IO: read/write latency, errors, fill-up
- Clock skew, DNS chaos, HTTP chaos, kernel chaos via BPF
- Orchestrates multi-step Workflows for complex game-day scenarios
Architecture Overview
Chaos Mesh ships a controller manager, a per-node chaos-daemon DaemonSet (which uses nsenter/iptables/tc/BPF for kernel-level injection), and a React dashboard. CRDs declare the experiment; the controller resolves target pods, instructs the daemons, and records status transitions. Experiments are cleaned up automatically at duration end or when the CR is deleted.
Self-Hosting & Configuration
- Helm chart or Operator —
chaos-meshandchaos-daemonrun cluster-wide - RBAC: restrict namespaces via
chaosmesh.org/inject: enabledlabels - Dashboard with Google/GitHub/OIDC SSO;
chaosctlCLI for scripting - Integrations: Argo Workflows, GitHub Actions, Litmus via CRD
- Observability: Prometheus metrics, experiment events shipped to OpenTelemetry
Key Features
- Pure CRD interface — GitOps and code review friendly
- Rich fault taxonomy (pod, net, IO, DNS, HTTP, kernel, time)
- Schedule + Workflow resources for recurring and multi-step drills
- Safety switches: dry-run, blast-radius labels, auto-recovery on CR deletion
- CNCF incubating project with active PingCAP + community maintainers
Comparison with Similar Tools
- LitmusChaos — similar CNCF project; experiment-hub workflow, different CR model
- Gremlin — commercial SaaS; richer UI, paid per-target
- Chaos Monkey (Netflix) — original, EC2-only, limited to instance termination
- AWS Fault Injection Simulator — AWS-native; tightly coupled to AWS APIs
- Powerful Seal — older, less active; mostly pod-kill scope
FAQ
Q: Is it safe for production? A: With proper namespace selectors, blast-radius labels, and approvals, many teams run Chaos Mesh in prod game-days. Start in staging.
Q: Does it need privileged pods?
A: Yes. chaos-daemon needs host network + capabilities for iptables/tc injection.
Q: Can I run experiments in CI?
A: Yes. chaosctl or raw kubectl in GitHub Actions; assert recovery via Prometheus queries.
Q: How do I stop a rogue experiment?
A: kubectl delete networkchaos web-latency triggers automatic cleanup within seconds.