TOKREPO · Arsenal IA
Nouveau · cette semaine

Pack de Réponse aux Incidents de Production

Dix choix pour l'ingénieur on-call au milieu d'un incendie en prod. Pagination, recherche logs+traces via MCP, routage d'alertes, automatisation des runbooks, page de statut, agent de postmortem. Oncall-Guide + Devops Incident Responder + PagerDuty Responder + SigNoz MCP + Monoscope + Graylog + Alertmanager + Rundeck + OpenStatus + Incident Responder. Installez dans cet ordre pour que la prochaine alerte affronte un système, pas une personne.

10 ressources

What's in this pack

It's 2:47 AM. PagerDuty just woke you. The error budget is gone in fourteen minutes. This pack is the rig you wish you'd installed last quarter — not a 50-tool observability shopping list, but the ten things the engineer mid-fire actually reaches for, in the order an incident actually unfolds.

Every pick here is open-source or has an OSS core, runs in your own infra, and earns its keystroke during the worst ten minutes of your week. The order is not alphabetical — it tracks the lifecycle: page in → triage → search → execute → communicate → write it up.

Install in this order

  1. Oncall-Guide — Incident Response Subagent — start here. Drop-in Claude Code subagent that walks the on-call checklist autonomously (deploy correlation, error spike triage, rollback decision). Inspired by Boris Cherny's oncall playbook. This is the brain the rest of the tools plug into.
  2. Claude Code Agent: Devops Incident Responder — the triage agent that runs the first 90 seconds: pulls recent deploys, checks dashboards, flags suspect commits. Bind it to a slash command in your editor and you've cut MTTA in half.
  3. Claude Code Agent: PagerDuty Incident Responder — wires the agent into PagerDuty itself. Acknowledges, escalates, posts updates to the incident channel. Removes the "is anyone looking at this?" Slack noise that eats the first five minutes.
  4. SigNoz MCP Server — Query Traces, Logs & Alerts — gives your agent a single MCP tool to grep distributed traces and logs side-by-side. When the agent says "the p99 latency spike correlates with deploy abc123 on cart-service", this is the data source.
  5. Monoscope — LLM Query for Logs/Traces/Metrics — natural-language log search across stacks. "Show me 5xx for /checkout in the last 15 minutes from the new pod" becomes one query instead of three Kibana dashboards. The agent uses it; humans use it when the agent is wrong.
  6. Graylog — Centralized Log Management — the log substrate if you don't already have one. SigNoz and Monoscope read from it; runbooks dump to it; the postmortem agent quotes from it. Self-hosted, no per-GB pricing trap.
  7. Prometheus Alertmanager — Alert Routing and Notification Hub — the routing brain that decides who gets paged, when alerts silence, and how to group flapping signals. Tune this before adding more dashboards. Most pager fatigue is an Alertmanager config problem, not a dashboard problem.
  8. Rundeck — Open Source Runbook Automation — the place runbooks become buttons. "Restart the worker pool", "flush the cache", "rotate the read replica" are jobs the on-call clicks instead of remembering. The agent can trigger them with permission gates.
  9. OpenStatus — Open-Source Monitoring and Status Page — public-facing status page, auto-updated from the same alerts. Saves the on-call from also being the comms lead. Customers see a yellow banner before they tweet at you.
  10. Claude Code Agent: Incident Responder — the postmortem-writing agent. Once mitigation is in, it scrapes the Slack channel, PagerDuty timeline, deploy history, and SigNoz queries into a five-whys draft you edit instead of write. Same agent type as #1, different prompt.

How they fit together

PagerDuty page
   │
   ▼
PagerDuty Responder agent  ──── ack + first triage post
   │
   ▼
Devops Incident Responder  ──── pulls deploys, dashboards, suspect commits
   │
   ├──► SigNoz MCP   ──► traces + log correlation
   ├──► Monoscope    ──► natural-language log queries
   └──► Graylog      ──► raw log substrate
   │
   ▼
Alertmanager  ──── silence flapping signals, regroup
   │
   ▼
Rundeck  ──── execute runbook (restart / flush / failover)
   │
   ▼
OpenStatus  ──── public status page auto-updates
   │
   ▼
Incident Responder agent  ──── postmortem draft (five whys + timeline)

The loop closes when the postmortem agent finds the action item that, had it shipped last week, would have prevented the page. File the ticket. Sleep.

Tradeoffs you'll hit

  • SigNoz vs Datadog — Datadog is the polished SaaS incumbent. SigNoz is the OSS bet you make when your bill goes from $4K to $40K/month and someone asks why. The MCP server is the bridge that makes either workable from an agent.
  • Monoscope vs grep + jq — for a 3-engineer team, grep + jq is fine. Past 50 services, you want natural-language search because no one remembers every service's log schema at 3 AM.
  • Rundeck vs raw shell scripts in a repo — raw scripts work until the on-call who wrote them is on PTO. Rundeck adds auth, audit log, and a "click to run" UI your future self will thank you for.
  • One postmortem agent vs writing it yourself — the agent's first draft is 70%. The 30% the human adds (context, intent, blameless framing) is what makes the doc useful. Don't ship the agent's draft unedited.

Common pitfalls

  • Wiring the triage agent without rate limits — first outage, the agent fires 200 SigNoz queries in 30 seconds and adds load to the system on fire. Set query budgets per incident.
  • Skipping Alertmanager grouping rules — without grouping, one upstream blip pages five teams. The Alertmanager group_by config is the difference between "useful page" and "on-call burns out in six weeks".
  • Status page lying because OpenStatus uses the same monitoring that's down — host the status page on independent infra. Different cloud, different DNS provider, different paging.
  • Postmortem-by-LLM with no human edit — the postmortem is the artifact that changes culture. An unedited LLM draft erodes trust in the practice. Always have a human in the loop on the final doc.
  • Runbooks in a wiki nobody reads — Rundeck only earns its keep if the runbooks are linked from the alerts. The Alertmanager → Rundeck link is the load-bearing wire.
INSTALLER · UNE COMMANDE
$ tokrepo install pack/production-incident-response
passez-la à votre agent — ou collez-la dans votre terminal
Ce qu'il contient

10 ressources prêtes à installer

Skill#01
oncall-guide — Incident Response Subagent

Open-source Claude Code subagent for incident response — walks the oncall checklist autonomously: deploys, errors, rollback. Inspired by Boris Cherny.

by Skill Factory·161 views
$ tokrepo install oncall-guide-incident-response-subagent-1a6b17c7
Skill#02
Claude Code Agent: Devops Incident Responder

Use when actively responding to production incidents, diagnosing critical service failures, or conducting incident postmortems to implement permanent fixes and preventative...

by TokRepo精选·27 views
$ tokrepo install claude-code-agent-devops-incident-responder-e30c19c4
Skill#03
Claude Code Agent: Pagerduty Incident Responder

Responds to PagerDuty incidents by analyzing incident context, identifying recent code changes, and suggesting fixes via GitHub PRs.

by TokRepo精选·26 views
$ tokrepo install claude-code-agent-pagerduty-incident-responder-d3f997e8
MCP#04
SigNoz MCP Server — Query Traces, Logs & Alerts

SigNoz MCP Server connects MCP clients to your SigNoz instance: query traces/logs, inspect alerts, and automate observability workflows using an API key.

by MCP Hub·86 views
$ tokrepo install signoz-mcp-server-query-traces-logs-alerts
Skill#05
Monoscope — LLM Query for Logs/Traces/Metrics

Monoscope stores logs/traces/metrics in S3-compatible buckets and lets you explore them with natural-language queries plus a CLI and self-hosted UI.

by Script Depot·65 views
$ tokrepo install monoscope-llm-query-for-logs-traces-metrics
Skill#06
Graylog — Centralized Log Management and Analysis Platform

Collect, index, and analyze log data from any source with a powerful search engine, real-time alerting, and customizable dashboards built for operations teams.

by AI Open Source·110 views
$ tokrepo install graylog-centralized-log-management-analysis-platform-68045e07
Skill#07
Prometheus Alertmanager — Alert Routing and Notification Hub

Alertmanager handles alerts sent by Prometheus, deduplicating, grouping, and routing them to the right notification channel such as email, Slack, PagerDuty, or webhooks.

by Script Depot·133 views
$ tokrepo install prometheus-alertmanager-alert-routing-notification-hub-51f92d7e
Skill#08
Rundeck — Open Source Runbook Automation and Job Scheduler

Automate operations tasks with Rundeck. Define runbooks as jobs with steps, schedule them, delegate execution to teams via self-service, and audit every action with built-in logging.

by AI Open Source·116 views
$ tokrepo install rundeck-open-source-runbook-automation-job-scheduler-d1bf0e61
Skill#09
OpenStatus — Open-Source Monitoring and Status Page Platform

OpenStatus is an open-source uptime monitoring and status page platform that checks endpoints from multiple regions, tracks latency and availability, and serves beautiful public status pages for your services.

by Script Depot·110 views
$ tokrepo install openstatus-open-source-monitoring-status-page-platform-ef13d2c6
Skill#10
Claude Code Agent: Incident Responder

Handles production incidents with urgency and precision. Use IMMEDIATELY when production issues occur. Coordinates debugging, implements fixes, and documents post-mortems.

by TokRepo精选·32 views
$ tokrepo install claude-code-agent-incident-responder-ee743381
Questions fréquentes

Questions fréquentes

How long does it take to install this rig end-to-end?

Plan for a one-day spike to get the agents wired (Oncall-Guide + Devops Responder + PagerDuty Responder + Incident Responder), plus a week of background work to install the data substrate (Graylog + SigNoz + Alertmanager) if you don't already have it. Rundeck and OpenStatus each take an afternoon. The agents pay back in the first incident; the substrate pays back in the second.

Do I need all four Claude Code agents, or is one enough?

Three are load-bearing: Oncall-Guide (the playbook brain), Devops Incident Responder (first-90-seconds triage), and Incident Responder (postmortem writer). PagerDuty Responder is optional if you already have a tight PagerDuty workflow you don't want disrupted. The agents share context patterns but solve different lifecycle stages, so collapsing them into one mega-agent costs you specificity.

Why SigNoz MCP and Monoscope — aren't they overlapping?

SigNoz MCP gives the agent a structured query interface to traces and logs together (correlate a slow trace to its log lines). Monoscope is for humans typing natural language when the agent missed it. Different audience, different ergonomics. If your team is small and stack is simple, you can ship with just SigNoz MCP and add Monoscope later.

Can I self-host all of this, or do some pieces need SaaS?

Every tool in this pack has a fully self-hostable mode. PagerDuty itself is SaaS (the responder agent wraps the PagerDuty API); if you want OSS paging too, swap in OneUptime or Grafana OnCall — both are in the broader incident-response catalog. The other nine tools run on a laptop or a single VM for testing.

What's the minimum viable subset if I can only install three things this sprint?

Oncall-Guide + Devops Incident Responder + Alertmanager. The first two cut MTTA on the next incident; Alertmanager cuts the pager-fatigue tax that erodes everything else. Add SigNoz MCP next sprint, then Rundeck, then OpenStatus. Postmortem agent goes last because it only matters after you've had a postmortem-worthy incident.

PLUS DANS L'ARSENAL

12 packs · 80+ ressources sélectionnées

Découvrez tous les packs curatés sur la page d'accueil

Retour à tous les packs