Scripts2026年4月15日·1 分钟阅读

Chaos Mesh — Cloud-Native Chaos Engineering for Kubernetes

CNCF chaos-engineering platform that injects pod, network, IO, DNS, and kernel faults into Kubernetes clusters via CRDs.

Introduction

Chaos Mesh is a CNCF incubating project that lets platform teams run controlled, reproducible failure experiments against a live Kubernetes cluster. By expressing faults as CRDs, teams can version, schedule, and gate experiments in CI the same way they manage any other Kubernetes resource — essential for proving resilience claims.

What Chaos Mesh Does

  • Injects pod failures (kill, container-kill, pod-failure)
  • Simulates network chaos: latency, packet loss, bandwidth throttle, partition
  • Faults disk IO: read/write latency, errors, fill-up
  • Clock skew, DNS chaos, HTTP chaos, kernel chaos via BPF
  • Orchestrates multi-step Workflows for complex game-day scenarios

Architecture Overview

Chaos Mesh ships a controller manager, a per-node chaos-daemon DaemonSet (which uses nsenter/iptables/tc/BPF for kernel-level injection), and a React dashboard. CRDs declare the experiment; the controller resolves target pods, instructs the daemons, and records status transitions. Experiments are cleaned up automatically at duration end or when the CR is deleted.

Self-Hosting & Configuration

  • Helm chart or Operator — chaos-mesh and chaos-daemon run cluster-wide
  • RBAC: restrict namespaces via chaosmesh.org/inject: enabled labels
  • Dashboard with Google/GitHub/OIDC SSO; chaosctl CLI for scripting
  • Integrations: Argo Workflows, GitHub Actions, Litmus via CRD
  • Observability: Prometheus metrics, experiment events shipped to OpenTelemetry

Key Features

  • Pure CRD interface — GitOps and code review friendly
  • Rich fault taxonomy (pod, net, IO, DNS, HTTP, kernel, time)
  • Schedule + Workflow resources for recurring and multi-step drills
  • Safety switches: dry-run, blast-radius labels, auto-recovery on CR deletion
  • CNCF incubating project with active PingCAP + community maintainers

Comparison with Similar Tools

  • LitmusChaos — similar CNCF project; experiment-hub workflow, different CR model
  • Gremlin — commercial SaaS; richer UI, paid per-target
  • Chaos Monkey (Netflix) — original, EC2-only, limited to instance termination
  • AWS Fault Injection Simulator — AWS-native; tightly coupled to AWS APIs
  • Powerful Seal — older, less active; mostly pod-kill scope

FAQ

Q: Is it safe for production? A: With proper namespace selectors, blast-radius labels, and approvals, many teams run Chaos Mesh in prod game-days. Start in staging.

Q: Does it need privileged pods? A: Yes. chaos-daemon needs host network + capabilities for iptables/tc injection.

Q: Can I run experiments in CI? A: Yes. chaosctl or raw kubectl in GitHub Actions; assert recovery via Prometheus queries.

Q: How do I stop a rogue experiment? A: kubectl delete networkchaos web-latency triggers automatic cleanup within seconds.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产