TOKREPO · 主题包

稳定

生产事故响应工具包

为正在救火的 on-call 工程师准备的十个工具，按事故响应流程排序：oncall-guide skill + Devops Incident Responder + PagerDuty Responder + SigNoz MCP（日志+链路）+ Monoscope（自然语言查日志）+ Graylog + Alertmanager + Rundeck（runbook 自动化）+ OpenStatus（状态页）+ Incident Responder agent（写复盘）。装完下一次告警进来面对的是系统，不是人。

10 个资产

关于这个主题包

这个 pack 包含什么

凌晨 2:47。PagerDuty 把你叫醒。SLO 错误预算还有 14 分钟烧光。这个 pack 就是上个季度你后悔没装的那套支架 — 不是 50 个工具的 observability 购物清单，而是正在救火的工程师真正会伸手去抓的那十个，按事故真实发生的顺序排好。

每一个都是开源或有开源核心、可以跑在自己基础设施上、在你这周最难熬的十分钟里值得占一个快捷键。顺序不是字母序 — 是按生命周期走的：告警进来 → 分诊 → 查日志/链路 → 执行 runbook → 对外通信 → 写复盘。

它们怎么协同

PagerDuty 告警
   │
   ▼
PagerDuty Responder agent  ──── ack + 第一条 triage 帖
   │
   ▼
Devops Incident Responder  ──── 拉部署/dashboard/可疑 commit
   │
   ├──► SigNoz MCP   ──► 链路 + 日志关联
   ├──► Monoscope    ──► 自然语言查日志
   └──► Graylog      ──► 原始日志底座
   │
   ▼
Alertmanager  ──── 静音抖动信号、重新聚合
   │
   ▼
Rundeck  ──── 执行 runbook（重启 / 清缓存 / failover）
   │
   ▼
OpenStatus  ──── 公开状态页自动更新
   │
   ▼
Incident Responder agent  ──── 复盘草稿（five-whys + 时间线）

闭环完成的标志：postmortem agent 找到那个如果上周就上线就能阻止这次 page 的 action item。开 ticket。睡觉。

你会遇到的取舍

SigNoz vs Datadog — Datadog 是闭源 SaaS 霸主。SigNoz 是当你的账单从 4K 涨到 40K/月有人问起时的开源退路。MCP server 是让两个里随便哪个都能从 agent 用起来的桥梁。
Monoscope vs grep + jq — 3 人小团队 grep + jq 够用。过了 50 个服务，你需要自然语言搜索，因为凌晨 3 点没人记得每个服务的日志 schema。
Rundeck vs 仓库里的 shell 脚本 — 裸脚本一直能用，直到写它的 on-call 休年假。Rundeck 加了认证、审计、点击运行 UI，你未来的自己会感谢你。
一个 postmortem agent vs 自己写 — agent 第一稿能到 70%。剩下 30%（上下文、意图、blameless 表述）才是文档真正有用的部分。别把 agent 草稿不改就发出去。

常见踩坑

triage agent 没设速率限制 — 第一次出事 agent 30 秒打 200 次 SigNoz 查询，给正在着火的系统又加了一层压。每次事故设查询预算。
跳过 Alertmanager 分组规则 — 没分组的话一个上游小抖动 page 五个团队。group_by 配置就是「有用的 page」和「on-call 六周烧光」之间的分水岭。
状态页骗人，因为 OpenStatus 用的是同一套已经挂了的监控 — 状态页放在独立基础设施上。不同云、不同 DNS、不同 paging 链路。
LLM 写完不改就发的 postmortem — 复盘文档是改变文化的产物。没改的 LLM 草稿会侵蚀大家对这套实践的信任。终稿必须有人在 loop 里。
Runbook 写在没人看的 wiki 里 — Rundeck 只有 runbook 被告警链接到才值回票价。Alertmanager → Rundeck 那条链路是承重的。

安装 · 一行命令

$ tokrepo install pack/production-incident-response

丢给 agent，或粘到终端

包内含什么

10 个资产打包就绪

Skill#01

oncall-guide — Incident Response Subagent

Open-source Claude Code subagent for incident response — walks the oncall checklist autonomously: deploys, errors, rollback. Inspired by Boris Cherny.

by Skill Factory·311 views

$ tokrepo install oncall-guide-incident-response-subagent-1a6b17c7

Skill#02

Claude Code Agent: Devops Incident Responder

Use when actively responding to production incidents, diagnosing critical service failures, or conducting incident postmortems to implement permanent fixes and preventative...

by TokRepo精选·175 views

$ tokrepo install claude-code-agent-devops-incident-responder-e30c19c4

Skill#03

Claude Code Agent: Pagerduty Incident Responder

Responds to PagerDuty incidents by analyzing incident context, identifying recent code changes, and suggesting fixes via GitHub PRs.

by TokRepo精选·110 views

$ tokrepo install claude-code-agent-pagerduty-incident-responder-d3f997e8

MCP#04

SigNoz MCP Server — Query Traces, Logs & Alerts

SigNoz MCP Server connects MCP clients to your SigNoz instance: query traces/logs, inspect alerts, and automate observability workflows using an API key.

by MCP Hub·262 views

$ tokrepo install signoz-mcp-server-query-traces-logs-alerts

Skill#05

Monoscope — LLM Query for Logs/Traces/Metrics

Monoscope stores logs/traces/metrics in S3-compatible buckets and lets you explore them with natural-language queries plus a CLI and self-hosted UI.

by Script Depot·177 views

$ tokrepo install monoscope-llm-query-for-logs-traces-metrics

Skill#06

Graylog — Centralized Log Management and Analysis Platform

Collect, index, and analyze log data from any source with a powerful search engine, real-time alerting, and customizable dashboards built for operations teams.

by AI Open Source·216 views

$ tokrepo install graylog-centralized-log-management-analysis-platform-68045e07

Skill#07

Prometheus Alertmanager — Alert Routing and Notification Hub

Alertmanager handles alerts sent by Prometheus, deduplicating, grouping, and routing them to the right notification channel such as email, Slack, PagerDuty, or webhooks.

by Script Depot·221 views

$ tokrepo install prometheus-alertmanager-alert-routing-notification-hub-51f92d7e

Skill#08

Rundeck — Open Source Runbook Automation and Job Scheduler

Automate operations tasks with Rundeck. Define runbooks as jobs with steps, schedule them, delegate execution to teams via self-service, and audit every action with built-in logging.

by AI Open Source·193 views

$ tokrepo install rundeck-open-source-runbook-automation-job-scheduler-d1bf0e61

Skill#09

OpenStatus — Open-Source Monitoring and Status Page Platform

OpenStatus is an open-source uptime monitoring and status page platform that checks endpoints from multiple regions, tracks latency and availability, and serves beautiful public status pages for your services.

by Script Depot·188 views

$ tokrepo install openstatus-open-source-monitoring-status-page-platform-ef13d2c6

Skill#10

Claude Code Agent: Incident Responder

Handles production incidents with urgency and precision. Use IMMEDIATELY when production issues occur. Coordinates debugging, implements fixes, and documents post-mortems.

by TokRepo精选·108 views

$ tokrepo install claude-code-agent-incident-responder-ee743381

常见问题

整套装下来要多久？

agent 接线（Oncall-Guide + Devops Responder + PagerDuty Responder + Incident Responder）规划一天 spike；如果还没有数据底座（Graylog + SigNoz + Alertmanager），背景再加一周。Rundeck 和 OpenStatus 各占一个下午。agent 第一次出事就回本；底座第二次出事回本。

四个 Claude Code agent 都需要吗，装一个够不够？

三个承重：Oncall-Guide（playbook 大脑）、Devops Incident Responder（前 90 秒分诊）、Incident Responder（写复盘）。PagerDuty Responder 可选 — 如果你已经有一套不想被打乱的 PagerDuty workflow 就跳过。这几个 agent 共享 context 模式但解决生命周期不同阶段，揉成一个大 agent 会损失针对性。

SigNoz MCP 和 Monoscope 是不是重复了？

SigNoz MCP 给 agent 一个结构化的查询接口，链路和日志一起查（把慢链路关联到对应日志行）。Monoscope 是给人在 agent 没查到的时候自己敲自然语言用的。受众不同、人机工程不同。如果团队小、技术栈简单，可以只上 SigNoz MCP，Monoscope 后面再加。

全部能自托管吗，有没有必须用 SaaS 的？

pack 里每个工具都有完整自托管模式。PagerDuty 本身是 SaaS（responder agent 包的是 PagerDuty API）；如果想 paging 也开源，换成 OneUptime 或 Grafana OnCall — 两个都在更大的 incident-response 目录里。其他九个都能在一台笔记本或单台 VM 上跑起来测试。

如果这个 sprint 只能装三件，最小可行子集是什么？

Oncall-Guide + Devops Incident Responder + Alertmanager。前两个砍下一次事故的 MTTA；Alertmanager 砍掉侵蚀一切的 pager 疲劳税。下个 sprint 加 SigNoz MCP，再下一个加 Rundeck，再下一个加 OpenStatus。Postmortem agent 放最后 — 它只在你有了值得复盘的事故之后才发挥价值。

更多主题包

12 个主题包 · 80+ 精选资产

回首页浏览全部精选合集

返回主题包总览