Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 28, 2026·2 min de lectura

oncall-guide — Incident Response Subagent

Open-source Claude Code subagent for incident response — walks the oncall checklist autonomously: deploys, errors, rollback. Inspired by Boris Cherny.

Skill Factory · Community

Listo para agents

Instalación con revisión previa

Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.

Needs Confirmation · 66/100Política: confirmar

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Established

Entrada

Install oncall-guide and run the incident playbook

Comando con revisión previa

npx -y tokrepo@latest install 1a6b17c7-03dd-4d7d-a511-def683b9c5e8 --target codex

Primero dry-run, confirma las escrituras y luego ejecuta este comando.

TL;DR

oncall-guide is a community-built Claude Code subagent that automates the first 5 minutes of incident response. It parses the alert, pulls the last 3 production deploys with git log, correlates Sentry errors since the most recent deploy, searches the #incidents Slack channel, looks up the matching runbook, and recommends ROLLBACK, INVESTIGATE, or WAIT with the exact next command. It is strictly read-only and never auto-rolls back, restarts, or pages anyone.

§01

What oncall-guide Does at 3 AM

When a page lands at 3 AM, the worst part is rarely the bug itself. The worst part is reconstructing context: which deploy went out, which dashboard to check, whether to roll back, who to call. oncall-guide is a Claude Code subagent that automates that opening checklist so the human can focus on the actual fix.

The subagent is a single Markdown file you save to .claude/agents/oncall-guide.md. After a /agents reload, you invoke it the moment a page lands by saying: "Use oncall-guide. The alert is: <paste alert text>." It then runs a deterministic 7-step workflow, emits a structured report, and proposes one of three recommendations: ROLLBACK, INVESTIGATE, or WAIT.

It is read-only by design. It will not roll back, will not silence alerts, and will not page anyone. It hands you the next exact command and lets you press enter. This boundary is deliberate; oncall systems where a subagent can mutate production are how 2 AM incidents become 4 AM postmortems.

Inspired by Boris Cherny’s public description of his Claude Code workflow on howborisusesclaudecode.com, this is a community-written equivalent—not Anthropic’s private subagent. The Pragmatic Engineer interview with Cherny describes how he hands oncall-style automation to subagents during paging events.

§02

Why a Subagent Beats a Slack Bot for the First 5 Minutes

Incident responders agree on one number: MTTA (mean time to acknowledge) below 5 minutes is the universal target for production-impacting alerts. Google’s SRE book identifies the first 5 minutes as the highest-leverage window in any incident, where context-gathering dominates time-to-mitigate. Yet the average responder, woken at night, spends 3-4 of those minutes on rote tasks: opening 6 tabs, running git log, scrolling Sentry, searching #incidents.

oncall-guide collapses those 3-4 minutes into roughly 30 seconds of subagent execution. The reason it works as a Claude Code subagent (not a separate bot) is that the subagent shares your terminal’s git context, your local runbook checkout, and any MCP servers you already have configured. There is no new auth, no separate Slack app, no PagerDuty plugin. Setup time is consistently under 2 minutes.

Four design choices make oncall-guide trustworthy at 3 AM:

Explicit decision logic — the rollback/investigate/wait branches are written in plain English in the subagent file, so you can audit and tune them per service.
Graceful degradation — if Sentry MCP or Slack MCP is missing, those steps are skipped with an explicit “not available” note rather than silent failure.
No destructive actions — the boundary section forbids rollback, restart, alert silencing, and paging.
Structured output — the report is a fixed-format text block you can paste verbatim into the incident channel.

§03

How oncall-guide Works: The 7-Step Workflow

The prompt_template in the subagent file defines a fixed 7-step workflow. The subagent does not improvise these steps; it walks them in order every time, which is what makes the output predictable enough to trust during a P0.

§04

Workflow

Parse the alert: service, severity, metric, threshold, time window.
Get the last 3 production deploys: git log origin/main --oneline -3 --since='6 hours ago'.
If Sentry MCP is available, fetch the top 3 new issues since the most recent deploy.
If Slack MCP is available, search the #incidents channel for related chatter in the last hour.
Check RUNBOOKS/<service>.md or docs/runbooks/<service>.md for a matching playbook.
Decide a recommendation:

Rollback if a deploy < 1h ago correlates with the metric spike
Investigate if no deploy correlation but new errors visible
Wait if metric is recovering on its own (last 2 datapoints trending down)

Emit the report below.


The steps deliberately move from cheapest to most expensive: parsing alert text is free, `git log` takes ~50ms, Sentry MCP costs an API call, Slack search costs another, and the runbook lookup is a local filesystem read. If any optional source is unavailable, the subagent reports it explicitly rather than silently dropping the step.

The decision logic in step 6 maps to three real incident archetypes recognized in Google’s SRE workbook chapter on incident management: deploy-induced regressions (rollback), latent bugs surfaced by traffic (investigate), and self-healing transient spikes (wait). Putting these branches in the subagent file rather than burying them in tool code means an oncall lead can adjust the 1-hour deploy correlation window for, say, a slower-rolling service simply by editing one line.

§05

The `oncall-guide.md` File: Frontmatter and Tools

The subagent’s frontmatter declares two things: which tools it can call, and a one-line description that Claude Code uses to decide when to suggest the subagent.

---
name: oncall-guide
description: Walk the oncall opening checklist — recent deploys, error correlation, runbook lookup, rollback decision. Use when paged.
tools: Bash, Read, Grep, Glob, mcp__sentry__*, mcp__slack__*
---

The tools field is a whitelist. Bash is required for git log. Read, Grep, and Glob cover runbook lookup. The two MCP wildcards (mcp__sentry__, mcp__slack__) are optional; if those MCP servers are not connected, Claude Code will not match them and the subagent’s steps 3 and 4 fall through. There is no Write, no Edit, no mcp__github__create_* — the subagent literally cannot mutate state.

Anthropic’s subagent documentation specifies that subagents run in their own context window, isolated from the parent conversation. This is exactly what you want during an incident: a tight 7-step workflow without the distraction of whatever you were doing before the page landed.

§06

Example Session: A Real-Looking Page

The README in the prompt template ships with a worked example that mirrors a typical Stripe SDK incident:

You:    "Use oncall-guide. Alert: payments-api error rate > 5% for 3min."
Claude: -> last deploys: a1b2c3d "Migrate Stripe SDK" (12m ago)
        -> Sentry: NEW issue PaymentError x 412 events since 11m ago
        -> runbook: RUNBOOKS/payments-api.md (matched section: SDK errors)
        -> Recommendation: ROLLBACK
        -> Why: deploy 12m ago directly correlates with error spike
        -> Next step: gh workflow run rollback.yml -f sha=<previous>

The four lines after Claude: map exactly to steps 2, 3, 5, and 6 of the workflow. Step 4 (Slack chatter) is omitted in this example because nothing matched. The final line surfaces the runbook’s rollback command verbatim; the responder reads it, understands what it does, and chooses to press enter.

Compare this to the unstructured alternative — ChatGPT in a browser tab, an SRE typing “what happened?” — which produces a different shape of output every time and cannot run git log against the actual repo. The structured 7-step workflow is what makes oncall-guide useful enough to keep around.

§07

When to Use (and When Not To)

The README is explicit about scope:

Use oncall-guide when:

A page lands and you need the first 5 minutes of context fast
You take an incident handoff and want to reconstruct what was visible
You are writing a postmortem and need to recover the timeline

Do not use oncall-guide for:

Chronic capacity or cost issues — those need analysis, not triage
Non-production environments unless they affect users
Ongoing incidents where context is already established (let humans drive)

The subagent is a triage tool, not a resolution tool. Once you have the recommendation and the next-step command, you exit the subagent and run the actual remediation in your normal Claude Code session (or by hand). Trying to push oncall-guide into resolution work breaks the read-only boundary that makes it safe in the first place.

§08

Five Common Questions Before Adoption

The FAQ section of the prompt template addresses the questions teams ask before adoption. Two are worth highlighting here.

“Will it actually roll back?” No. The recommendation line says ROLLBACK; the next-step line gives you the exact gh workflow run rollback.yml command; you press enter. Auto-rollback at 3 AM from a subagent is exactly the failure mode the boundaries section forbids.

“Do I need all the MCPs?” No. Graceful degradation is built into steps 3 and 4. Without Sentry MCP, error correlation is skipped with a Sentry: not configured line. Without Slack MCP, the chatter check is skipped. The deploy check (step 2), runbook lookup (step 5), and decision logic (step 6) work with stock Claude Code.

The Pragmatic Engineer’s interview with Boris Cherny mentions that subagents handle 30-40% of his daily Claude Code workload, which suggests the pattern is durable enough for production teams. oncall-guide is the open-source version of that pattern applied to incident response.

§09

Composing oncall-guide With Other TokRepo Subagents

oncall-guide is one piece of a broader subagent library on TokRepo. After triage, the next-step command might trigger another workflow:

If ROLLBACK → hand off to a one-shot commit/PR slash command for the revert.
If INVESTIGATE → hand off to a Sentry triage subagent that classifies the new errors.
If the page comes from CI failure rather than production → use a build-validator subagent instead.

The handoff happens in your terminal: oncall-guide ends, you read the recommendation, you call the next subagent. There is no orchestration framework in the middle, which is intentional — oncall surface area at 3 AM should not depend on a coordinator process that itself can fail.

Preguntas frecuentes

Will oncall-guide actually roll back the deploy for me?+

No. oncall-guide is read-only by design. It outputs a Recommendation line (ROLLBACK, INVESTIGATE, or WAIT) and a Next step line containing the exact command, but the human responder presses enter. Auto-rollback from a subagent is the failure mode the boundaries section explicitly forbids.

Do I need Sentry MCP and Slack MCP for oncall-guide to work?+

No. The subagent uses graceful degradation. Without Sentry MCP, the error-correlation step is skipped with an explicit not-configured note. Without Slack MCP, the chatter check is skipped. The deploy check, runbook lookup, and decision logic all run with stock Claude Code Bash and Read tools.

How does oncall-guide define a metric spike or correlation?+

It uses your alert's own threshold and duration plus a 1-hour deploy-correlation window. The decision logic is written in plain English in the subagent file, so you can tune the window per service: a slow-rolling service might use 4 hours, a hotfix-heavy service might use 30 minutes.

Can oncall-guide page another engineer for a severe issue?+

No. Paging is destructive across the team and the subagent is read-only. If the recommendation is ROLLBACK and the runbook says escalate, the Next step line will surface the escalation command (PagerDuty, Opsgenie) for the human to run. The subagent never auto-pages.

Is this the actual subagent Boris Cherny uses internally?+

No. oncall-guide is a community-written equivalent based on Boris Cherny's public description on howborisusesclaudecode.com and his Pragmatic Engineer interview. It is not Anthropic's private subagent. The pattern (read-only triage, structured 7-step workflow) is the value, not the exact prompt.

How fast can a team install and start using oncall-guide?+

Setup is under 2 minutes. Save the Markdown file to .claude/agents/oncall-guide.md, run /agents reload in Claude Code, and the next page is the first invocation. Optional MCP servers (Sentry, Slack) take longer to wire up but are not required for the deploy and runbook steps to work.

Referencias (5)

— Anthropic: Claude Code subagents documentation
— Google SRE Workbook: Incident Response
— The Pragmatic Engineer: Building Claude Code with Boris Cherny
— Anthropic: Model Context Protocol
— GitHub: getsentry/sentry-mcp

Relacionados en TokRepo

Sentry errors auto-triage subagent for follow-up investigation One-shot commit, push, and PR slash command for fast rollbacks Build validator CI validation subagent for failed pipelines Loop local recurring task scheduler in the Boris Cherny style Code architect architecture review subagent for postmortems Verify app E2E test subagent for Claude Code regressions

🙏

Fuente y agradecimientos

Inspired by Boris Cherny's oncall workflow on howborisusesclaudecode.com.

Citations:

howborisusesclaudecode.com
Pragmatic Engineer interview: https://newsletter.pragmaticengineer.com/p/building-claude-code-with-boris-cherny
Get Push To Prod: https://getpushtoprod.substack.com/p/how-the-creator-of-claude-code-actually

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Grafana OnCall — Open Source Incident Response and On-Call Management

Manage on-call schedules and incident routing with Grafana OnCall. Integrates natively with Grafana alerting for automated escalations, multi-channel notifications, and team rotation management.

Skills

Grafana Labs

TheHive — Open Source Security Incident Response Platform

TheHive is a scalable, open-source security incident response platform that helps SOC teams investigate alerts, collaborate on cases, and automate response workflows.

Skills

AI Open Source

Claude Code Agent: Incident Responder — Debug Production Issues

Claude Code agent for incident response. Analyze logs, trace errors, identify root causes, and generate postmortem reports.

Skills

Skill Factory

GRR Rapid Response — Google's Open Source Incident Response Framework

A remote live forensics framework by Google for large-scale incident response, enabling security teams to collect artifacts and investigate endpoints at enterprise scale.

Scripts

Script Depot

Instalación con revisión previa

What oncall-guide Does at 3 AM

Why a Subagent Beats a Slack Bot for the First 5 Minutes

How oncall-guide Works: The 7-Step Workflow

Workflow

The oncall-guide.md File: Frontmatter and Tools

Example Session: A Real-Looking Page

When to Use (and When Not To)

Five Common Questions Before Adoption

Composing oncall-guide With Other TokRepo Subagents

Preguntas frecuentes

Referencias (5)

Relacionados en TokRepo

Fuente y agradecimientos

Discusión

Activos relacionados

Grafana OnCall — Open Source Incident Response and On-Call Management

TheHive — Open Source Security Incident Response Platform

Claude Code Agent: Incident Responder — Debug Production Issues

GRR Rapid Response — Google's Open Source Incident Response Framework

The `oncall-guide.md` File: Frontmatter and Tools