Chaos Mesh — Cloud-Native Chaos Engineering for Kubernetes
CNCF chaos-engineering platform that injects pod, network, IO, DNS, and kernel faults into Kubernetes clusters via CRDs.
What it is
Chaos Mesh is a CNCF incubating project that provides a chaos engineering platform for Kubernetes. It lets platform and SRE teams run controlled, reproducible failure experiments against live clusters by expressing faults as Kubernetes Custom Resource Definitions (CRDs). This means experiments are versioned, scheduled, and gated in CI exactly like any other Kubernetes resource.
Chaos Mesh is suited for DevOps engineers, SREs, and platform teams who need to validate resilience claims before incidents happen in production.
How it saves time or tokens
Without Chaos Mesh, teams write ad-hoc bash scripts or manually kill pods to test resilience. Chaos Mesh replaces that fragile approach with declarative CRDs that can be applied, reverted, and automated in CI pipelines. A network-latency experiment that would take an hour to set up manually can be defined in a single YAML and applied in seconds. The built-in Dashboard provides a visual workflow editor that further reduces setup time for complex multi-step game-day scenarios.
How to use
- Install Chaos Mesh via Helm into your cluster:
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace=chaos-mesh --create-namespace --version 2.6.3
- Define a chaos experiment as a CRD. For example, inject 500ms network latency into a service:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: web-latency
namespace: default
spec:
action: delay
mode: all
selector:
labelSelectors:
app: web
delay:
latency: '500ms'
jitter: '50ms'
duration: '2m'
- Apply the experiment with
kubectl apply -f latency.yamland observe your service behavior in the Chaos Mesh Dashboard or your existing monitoring stack.
Example
A multi-step workflow that kills a database pod, then injects network partition, then verifies recovery:
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: db-resilience-test
spec:
entry: serial-steps
templates:
- name: serial-steps
templateType: Serial
children:
- kill-db
- network-partition
- name: kill-db
templateType: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
labelSelectors:
app: postgres
- name: network-partition
templateType: NetworkChaos
networkChaos:
action: partition
mode: all
selector:
labelSelectors:
app: web
direction: both
duration: '60s'
Related on TokRepo
- DevOps automation tools — Browse more infrastructure and operations tools
- Self-hosted tools — Explore self-hosted software for your infrastructure
Common pitfalls
- Running chaos experiments in production without proper blast-radius controls. Always use label selectors and namespace scoping to limit the impact.
- Forgetting to set a
durationfield, which can leave faults running indefinitely and cause real outages. - Not integrating experiments into CI/CD. One-off manual chaos runs provide limited value compared to automated regression chaos tests.
Frequently Asked Questions
Chaos Mesh supports pod faults (kill, failure, container-kill), network chaos (latency, packet loss, partition, bandwidth throttle), IO faults (read/write latency, errors), DNS chaos, HTTP chaos, clock skew, and kernel-level faults via eBPF. Each fault type is a separate CRD.
Most fault types work without kernel modifications. Kernel-level chaos (like injecting syscall faults) uses eBPF and requires a Linux kernel 4.18 or later. Pod and network chaos work on any standard Kubernetes cluster.
Yes. Chaos Mesh provides a Schedule CRD that runs experiments on a cron-like schedule. You can also trigger experiments from CI pipelines by applying CRDs with kubectl, making it easy to run chaos tests on every deployment.
Both are CNCF chaos engineering projects for Kubernetes. Chaos Mesh uses CRDs natively and includes a built-in Dashboard with a visual workflow editor. Litmus uses a hub-based model with pre-built experiment charts. The choice depends on whether you prefer CRD-native workflows or a marketplace of pre-built experiments.
Chaos Mesh includes safety mechanisms: namespace-scoped permissions, label selectors for targeting, mandatory duration fields, and RBAC integration. However, any chaos tool can cause real impact if misconfigured. Start in staging environments and gradually expand to production with proper guardrails.
Citations (3)
- Chaos Mesh GitHub— Chaos Mesh is a CNCF incubating project
- Chaos Mesh Documentation— CRD-based fault injection for Kubernetes
- CNCF Landscape— CNCF Landscape listing for Chaos Mesh
Related on TokRepo
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.