TOKREPO · Arsenal de IA

Estable

Pack de Respuesta a Incidentes de Producción

Diez picks para el ingeniero on-call en medio de un incendio en prod. Paginación, búsqueda de logs+trazas vía MCP, ruteo de alertas, automatización de runbooks, página de estado, agente de postmortem. Oncall-Guide + Devops Incident Responder + PagerDuty Responder + SigNoz MCP + Monoscope + Graylog + Alertmanager + Rundeck + OpenStatus + Incident Responder. Instálalos en este orden para que la próxima alerta enfrente a un sistema, no a una persona.

10 recursos

Sobre este pack

What's in this pack

It's 2:47 AM. PagerDuty just woke you. The error budget is gone in fourteen minutes. This pack is the rig you wish you'd installed last quarter — not a 50-tool observability shopping list, but the ten things the engineer mid-fire actually reaches for, in the order an incident actually unfolds.

Every pick here is open-source or has an OSS core, runs in your own infra, and earns its keystroke during the worst ten minutes of your week. The order is not alphabetical — it tracks the lifecycle: page in → triage → search → execute → communicate → write it up.

Install in this order

Oncall-Guide — Incident Response Subagent — start here. Drop-in Claude Code subagent that walks the on-call checklist autonomously (deploy correlation, error spike triage, rollback decision). Inspired by Boris Cherny's oncall playbook. This is the brain the rest of the tools plug into.
Claude Code Agent: Devops Incident Responder — the triage agent that runs the first 90 seconds: pulls recent deploys, checks dashboards, flags suspect commits. Bind it to a slash command in your editor and you've cut MTTA in half.
Claude Code Agent: PagerDuty Incident Responder — wires the agent into PagerDuty itself. Acknowledges, escalates, posts updates to the incident channel. Removes the "is anyone looking at this?" Slack noise that eats the first five minutes.
SigNoz MCP Server — Query Traces, Logs & Alerts — gives your agent a single MCP tool to grep distributed traces and logs side-by-side. When the agent says "the p99 latency spike correlates with deploy abc123 on cart-service", this is the data source.
Monoscope — LLM Query for Logs/Traces/Metrics — natural-language log search across stacks. "Show me 5xx for /checkout in the last 15 minutes from the new pod" becomes one query instead of three Kibana dashboards. The agent uses it; humans use it when the agent is wrong.
Graylog — Centralized Log Management — the log substrate if you don't already have one. SigNoz and Monoscope read from it; runbooks dump to it; the postmortem agent quotes from it. Self-hosted, no per-GB pricing trap.
Prometheus Alertmanager — Alert Routing and Notification Hub — the routing brain that decides who gets paged, when alerts silence, and how to group flapping signals. Tune this before adding more dashboards. Most pager fatigue is an Alertmanager config problem, not a dashboard problem.
Rundeck — Open Source Runbook Automation — the place runbooks become buttons. "Restart the worker pool", "flush the cache", "rotate the read replica" are jobs the on-call clicks instead of remembering. The agent can trigger them with permission gates.
OpenStatus — Open-Source Monitoring and Status Page — public-facing status page, auto-updated from the same alerts. Saves the on-call from also being the comms lead. Customers see a yellow banner before they tweet at you.
Claude Code Agent: Incident Responder — the postmortem-writing agent. Once mitigation is in, it scrapes the Slack channel, PagerDuty timeline, deploy history, and SigNoz queries into a five-whys draft you edit instead of write. Same agent type as #1, different prompt.

How they fit together

PagerDuty page
   │
   ▼
PagerDuty Responder agent  ──── ack + first triage post
   │
   ▼
Devops Incident Responder  ──── pulls deploys, dashboards, suspect commits
   │
   ├──► SigNoz MCP   ──► traces + log correlation
   ├──► Monoscope    ──► natural-language log queries
   └──► Graylog      ──► raw log substrate
   │
   ▼
Alertmanager  ──── silence flapping signals, regroup
   │
   ▼
Rundeck  ──── execute runbook (restart / flush / failover)
   │
   ▼
OpenStatus  ──── public status page auto-updates
   │
   ▼
Incident Responder agent  ──── postmortem draft (five whys + timeline)

The loop closes when the postmortem agent finds the action item that, had it shipped last week, would have prevented the page. File the ticket. Sleep.

Tradeoffs you'll hit

SigNoz vs Datadog — Datadog is the polished SaaS incumbent. SigNoz is the OSS bet you make when your bill goes from $4K to $40K/month and someone asks why. The MCP server is the bridge that makes either workable from an agent.
Monoscope vs grep + jq — for a 3-engineer team, grep + jq is fine. Past 50 services, you want natural-language search because no one remembers every service's log schema at 3 AM.
Rundeck vs raw shell scripts in a repo — raw scripts work until the on-call who wrote them is on PTO. Rundeck adds auth, audit log, and a "click to run" UI your future self will thank you for.
One postmortem agent vs writing it yourself — the agent's first draft is 70%. The 30% the human adds (context, intent, blameless framing) is what makes the doc useful. Don't ship the agent's draft unedited.

Common pitfalls

Wiring the triage agent without rate limits — first outage, the agent fires 200 SigNoz queries in 30 seconds and adds load to the system on fire. Set query budgets per incident.
Skipping Alertmanager grouping rules — without grouping, one upstream blip pages five teams. The Alertmanager group_by config is the difference between "useful page" and "on-call burns out in six weeks".
Status page lying because OpenStatus uses the same monitoring that's down — host the status page on independent infra. Different cloud, different DNS provider, different paging.
Postmortem-by-LLM with no human edit — the postmortem is the artifact that changes culture. An unedited LLM draft erodes trust in the practice. Always have a human in the loop on the final doc.
Runbooks in a wiki nobody reads — Rundeck only earns its keep if the runbooks are linked from the alerts. The Alertmanager → Rundeck link is the load-bearing wire.

INSTALAR · UN COMANDO

$ tokrepo install pack/production-incident-response

pásalo a tu agente — o pégalo en tu terminal

Qué incluye

10 recursos listos para instalar

Skill#01

oncall-guide — Incident Response Subagent

Open-source Claude Code subagent for incident response — walks the oncall checklist autonomously: deploys, errors, rollback. Inspired by Boris Cherny.

by Skill Factory·311 views

$ tokrepo install oncall-guide-incident-response-subagent-1a6b17c7

Skill#02

Claude Code Agent: Devops Incident Responder

Use when actively responding to production incidents, diagnosing critical service failures, or conducting incident postmortems to implement permanent fixes and preventative...

by TokRepo精选·175 views

$ tokrepo install claude-code-agent-devops-incident-responder-e30c19c4

Skill#03

Claude Code Agent: Pagerduty Incident Responder

Responds to PagerDuty incidents by analyzing incident context, identifying recent code changes, and suggesting fixes via GitHub PRs.

by TokRepo精选·110 views

$ tokrepo install claude-code-agent-pagerduty-incident-responder-d3f997e8

MCP#04

SigNoz MCP Server — Query Traces, Logs & Alerts

SigNoz MCP Server connects MCP clients to your SigNoz instance: query traces/logs, inspect alerts, and automate observability workflows using an API key.

by MCP Hub·262 views

$ tokrepo install signoz-mcp-server-query-traces-logs-alerts

Skill#05

Monoscope — LLM Query for Logs/Traces/Metrics

Monoscope stores logs/traces/metrics in S3-compatible buckets and lets you explore them with natural-language queries plus a CLI and self-hosted UI.

by Script Depot·177 views

$ tokrepo install monoscope-llm-query-for-logs-traces-metrics

Skill#06

Graylog — Centralized Log Management and Analysis Platform

Collect, index, and analyze log data from any source with a powerful search engine, real-time alerting, and customizable dashboards built for operations teams.

by AI Open Source·216 views

$ tokrepo install graylog-centralized-log-management-analysis-platform-68045e07

Skill#07

Prometheus Alertmanager — Alert Routing and Notification Hub

Alertmanager handles alerts sent by Prometheus, deduplicating, grouping, and routing them to the right notification channel such as email, Slack, PagerDuty, or webhooks.

by Script Depot·221 views

$ tokrepo install prometheus-alertmanager-alert-routing-notification-hub-51f92d7e

Skill#08

Rundeck — Open Source Runbook Automation and Job Scheduler

Automate operations tasks with Rundeck. Define runbooks as jobs with steps, schedule them, delegate execution to teams via self-service, and audit every action with built-in logging.

by AI Open Source·193 views

$ tokrepo install rundeck-open-source-runbook-automation-job-scheduler-d1bf0e61

Skill#09

OpenStatus — Open-Source Monitoring and Status Page Platform

OpenStatus is an open-source uptime monitoring and status page platform that checks endpoints from multiple regions, tracks latency and availability, and serves beautiful public status pages for your services.

by Script Depot·188 views

$ tokrepo install openstatus-open-source-monitoring-status-page-platform-ef13d2c6

Skill#10

Claude Code Agent: Incident Responder

Handles production incidents with urgency and precision. Use IMMEDIATELY when production issues occur. Coordinates debugging, implements fixes, and documents post-mortems.

by TokRepo精选·108 views

$ tokrepo install claude-code-agent-incident-responder-ee743381

Preguntas frecuentes

How long does it take to install this rig end-to-end?

Plan for a one-day spike to get the agents wired (Oncall-Guide + Devops Responder + PagerDuty Responder + Incident Responder), plus a week of background work to install the data substrate (Graylog + SigNoz + Alertmanager) if you don't already have it. Rundeck and OpenStatus each take an afternoon. The agents pay back in the first incident; the substrate pays back in the second.

Do I need all four Claude Code agents, or is one enough?

Three are load-bearing: Oncall-Guide (the playbook brain), Devops Incident Responder (first-90-seconds triage), and Incident Responder (postmortem writer). PagerDuty Responder is optional if you already have a tight PagerDuty workflow you don't want disrupted. The agents share context patterns but solve different lifecycle stages, so collapsing them into one mega-agent costs you specificity.

Why SigNoz MCP and Monoscope — aren't they overlapping?

SigNoz MCP gives the agent a structured query interface to traces and logs together (correlate a slow trace to its log lines). Monoscope is for humans typing natural language when the agent missed it. Different audience, different ergonomics. If your team is small and stack is simple, you can ship with just SigNoz MCP and add Monoscope later.

Can I self-host all of this, or do some pieces need SaaS?

Every tool in this pack has a fully self-hostable mode. PagerDuty itself is SaaS (the responder agent wraps the PagerDuty API); if you want OSS paging too, swap in OneUptime or Grafana OnCall — both are in the broader incident-response catalog. The other nine tools run on a laptop or a single VM for testing.

What's the minimum viable subset if I can only install three things this sprint?

Oncall-Guide + Devops Incident Responder + Alertmanager. The first two cut MTTA on the next incident; Alertmanager cuts the pager-fatigue tax that erodes everything else. Add SigNoz MCP next sprint, then Rundeck, then OpenStatus. Postmortem agent goes last because it only matters after you've had a postmortem-worthy incident.

MÁS DEL ARSENAL

12 packs · 80+ recursos seleccionados

Explora todos los packs curados en la página principal

Volver a todos los packs