Pack de Respuesta a Incidentes de Producción
Diez picks para el ingeniero on-call en medio de un incendio en prod. Paginación, búsqueda de logs+trazas vía MCP, ruteo de alertas, automatización de runbooks, página de estado, agente de postmortem. Oncall-Guide + Devops Incident Responder + PagerDuty Responder + SigNoz MCP + Monoscope + Graylog + Alertmanager + Rundeck + OpenStatus + Incident Responder. Instálalos en este orden para que la próxima alerta enfrente a un sistema, no a una persona.
What's in this pack
It's 2:47 AM. PagerDuty just woke you. The error budget is gone in fourteen minutes. This pack is the rig you wish you'd installed last quarter — not a 50-tool observability shopping list, but the ten things the engineer mid-fire actually reaches for, in the order an incident actually unfolds.
Every pick here is open-source or has an OSS core, runs in your own infra, and earns its keystroke during the worst ten minutes of your week. The order is not alphabetical — it tracks the lifecycle: page in → triage → search → execute → communicate → write it up.
Install in this order
- Oncall-Guide — Incident Response Subagent — start here. Drop-in Claude Code subagent that walks the on-call checklist autonomously (deploy correlation, error spike triage, rollback decision). Inspired by Boris Cherny's oncall playbook. This is the brain the rest of the tools plug into.
- Claude Code Agent: Devops Incident Responder — the triage agent that runs the first 90 seconds: pulls recent deploys, checks dashboards, flags suspect commits. Bind it to a slash command in your editor and you've cut MTTA in half.
- Claude Code Agent: PagerDuty Incident Responder — wires the agent into PagerDuty itself. Acknowledges, escalates, posts updates to the incident channel. Removes the "is anyone looking at this?" Slack noise that eats the first five minutes.
- SigNoz MCP Server — Query Traces, Logs & Alerts — gives your agent a single MCP tool to grep distributed traces and logs side-by-side. When the agent says "the p99 latency spike correlates with deploy abc123 on cart-service", this is the data source.
- Monoscope — LLM Query for Logs/Traces/Metrics — natural-language log search across stacks. "Show me 5xx for /checkout in the last 15 minutes from the new pod" becomes one query instead of three Kibana dashboards. The agent uses it; humans use it when the agent is wrong.
- Graylog — Centralized Log Management — the log substrate if you don't already have one. SigNoz and Monoscope read from it; runbooks dump to it; the postmortem agent quotes from it. Self-hosted, no per-GB pricing trap.
- Prometheus Alertmanager — Alert Routing and Notification Hub — the routing brain that decides who gets paged, when alerts silence, and how to group flapping signals. Tune this before adding more dashboards. Most pager fatigue is an Alertmanager config problem, not a dashboard problem.
- Rundeck — Open Source Runbook Automation — the place runbooks become buttons. "Restart the worker pool", "flush the cache", "rotate the read replica" are jobs the on-call clicks instead of remembering. The agent can trigger them with permission gates.
- OpenStatus — Open-Source Monitoring and Status Page — public-facing status page, auto-updated from the same alerts. Saves the on-call from also being the comms lead. Customers see a yellow banner before they tweet at you.
- Claude Code Agent: Incident Responder — the postmortem-writing agent. Once mitigation is in, it scrapes the Slack channel, PagerDuty timeline, deploy history, and SigNoz queries into a five-whys draft you edit instead of write. Same agent type as #1, different prompt.
How they fit together
PagerDuty page
│
▼
PagerDuty Responder agent ──── ack + first triage post
│
▼
Devops Incident Responder ──── pulls deploys, dashboards, suspect commits
│
├──► SigNoz MCP ──► traces + log correlation
├──► Monoscope ──► natural-language log queries
└──► Graylog ──► raw log substrate
│
▼
Alertmanager ──── silence flapping signals, regroup
│
▼
Rundeck ──── execute runbook (restart / flush / failover)
│
▼
OpenStatus ──── public status page auto-updates
│
▼
Incident Responder agent ──── postmortem draft (five whys + timeline)
The loop closes when the postmortem agent finds the action item that, had it shipped last week, would have prevented the page. File the ticket. Sleep.
Tradeoffs you'll hit
- SigNoz vs Datadog — Datadog is the polished SaaS incumbent. SigNoz is the OSS bet you make when your bill goes from $4K to $40K/month and someone asks why. The MCP server is the bridge that makes either workable from an agent.
- Monoscope vs grep + jq — for a 3-engineer team, grep + jq is fine. Past 50 services, you want natural-language search because no one remembers every service's log schema at 3 AM.
- Rundeck vs raw shell scripts in a repo — raw scripts work until the on-call who wrote them is on PTO. Rundeck adds auth, audit log, and a "click to run" UI your future self will thank you for.
- One postmortem agent vs writing it yourself — the agent's first draft is 70%. The 30% the human adds (context, intent, blameless framing) is what makes the doc useful. Don't ship the agent's draft unedited.
Common pitfalls
- Wiring the triage agent without rate limits — first outage, the agent fires 200 SigNoz queries in 30 seconds and adds load to the system on fire. Set query budgets per incident.
- Skipping Alertmanager grouping rules — without grouping, one upstream blip pages five teams. The Alertmanager
group_byconfig is the difference between "useful page" and "on-call burns out in six weeks". - Status page lying because OpenStatus uses the same monitoring that's down — host the status page on independent infra. Different cloud, different DNS provider, different paging.
- Postmortem-by-LLM with no human edit — the postmortem is the artifact that changes culture. An unedited LLM draft erodes trust in the practice. Always have a human in the loop on the final doc.
- Runbooks in a wiki nobody reads — Rundeck only earns its keep if the runbooks are linked from the alerts. The Alertmanager → Rundeck link is the load-bearing wire.
10 recursos listos para instalar
Preguntas frecuentes
How long does it take to install this rig end-to-end?
Plan for a one-day spike to get the agents wired (Oncall-Guide + Devops Responder + PagerDuty Responder + Incident Responder), plus a week of background work to install the data substrate (Graylog + SigNoz + Alertmanager) if you don't already have it. Rundeck and OpenStatus each take an afternoon. The agents pay back in the first incident; the substrate pays back in the second.
Do I need all four Claude Code agents, or is one enough?
Three are load-bearing: Oncall-Guide (the playbook brain), Devops Incident Responder (first-90-seconds triage), and Incident Responder (postmortem writer). PagerDuty Responder is optional if you already have a tight PagerDuty workflow you don't want disrupted. The agents share context patterns but solve different lifecycle stages, so collapsing them into one mega-agent costs you specificity.
Why SigNoz MCP and Monoscope — aren't they overlapping?
SigNoz MCP gives the agent a structured query interface to traces and logs together (correlate a slow trace to its log lines). Monoscope is for humans typing natural language when the agent missed it. Different audience, different ergonomics. If your team is small and stack is simple, you can ship with just SigNoz MCP and add Monoscope later.
Can I self-host all of this, or do some pieces need SaaS?
Every tool in this pack has a fully self-hostable mode. PagerDuty itself is SaaS (the responder agent wraps the PagerDuty API); if you want OSS paging too, swap in OneUptime or Grafana OnCall — both are in the broader incident-response catalog. The other nine tools run on a laptop or a single VM for testing.
What's the minimum viable subset if I can only install three things this sprint?
Oncall-Guide + Devops Incident Responder + Alertmanager. The first two cut MTTA on the next incident; Alertmanager cuts the pager-fatigue tax that erodes everything else. Add SigNoz MCP next sprint, then Rundeck, then OpenStatus. Postmortem agent goes last because it only matters after you've had a postmortem-worthy incident.
12 packs · 80+ recursos seleccionados
Explora todos los packs curados en la página principal
Volver a todos los packs