Anthropic just dropped a new entry in its “Cooking with Claude” series, this time walking through how to build an SRE incident response agent. According to Anthropic, the idea is to hand Claude the messy, high-pressure work of triaging outages: reading alerts, pulling context, and proposing fixes while a human stays in the loop. This is significant because incident response is where AI agents earn their keep. The stakes are real, the data is live, and the payoff is faster recovery when systems break.
Here’s a practical guide built around that approach.
Quick Start
What you’ll learn: how to stand up an AI agent that helps your on-call engineers respond to incidents faster.
What you need: access to the Claude API, your observability and logging tools, a runbook or two, and a safe place to test before anything touches production.
The Steps
- Define the agent’s job. Start narrow. Pick one painful slice of incident response, like triaging alerts or summarizing what changed before an outage. A tight scope keeps the agent useful and easy to evaluate. Trying to automate the whole pager at once is how these projects stall.
- Give the agent tools, not just text. An incident agent is only as good as what it can reach. Wire up tools for querying logs, checking metrics, pulling recent deploys, and reading runbooks. Claude’s tool-use lets the model call these on its own instead of waiting for a human to paste data in.
- Feed it the right context. Connect your monitoring, dashboards, and past incident records. The agent needs to see what’s actually happening to reason about it. Garbage context produces confident, wrong answers, so curate what you expose.
- Write a clear system prompt. Tell the agent who it is, what it can touch, and how cautious to be. Spell out the escalation rules. A good system prompt is the difference between a helpful assistant and one that guesses past its authority.
- Keep a human in the loop. Let the agent investigate and recommend, but gate any action that changes production behind human approval. This is the single most important guardrail. An agent that can restart services on its own is a new way to cause outages.
- Test against real past incidents. Replay old incidents and check whether the agent reaches the right conclusions. This gives you an honest read on accuracy before you trust it live. Treat it like any other system: measure before you ship.
- Roll out in shadow mode first. Run the agent alongside your team without acting on its output. Compare its calls to what humans actually did. Once it earns trust, widen its responsibilities step by step.
Tips and Warnings
- Scope creep is the enemy. Add capabilities only after the current ones prove reliable.
- Log everything the agent does. You’ll want an audit trail when reviewing incidents later.
- Don’t let it act without approval on anything destructive. Read access is cheap; write access is dangerous.
- Watch for confident hallucinations. The agent should cite the logs and metrics behind its conclusions so engineers can verify fast.
Why This Matters
What stands out here is the shift in what agents are being trusted with. SRE work is technical, time-sensitive, and consequential, exactly the kind of task that used to feel off-limits for AI. Anthropic putting out a hands-on build guide signals that agentic workflows are moving from demos into operational tooling. The pattern generalizes too. Swap the runbooks and tools, and the same skeleton works for security triage, customer support escalations, or data pipeline failures.
Next Steps
Start by picking your single highest-volume alert type and building an agent that only triages it. Measure how often it’s right against your incident history. From there, add one tool at a time, expand the context it can see, and slowly move from shadow mode to trusted recommendations. Anthropic has the full walkthrough on its labs site if you want the detailed build.