Anthropic Lifts the Lid on How It Cages Claude

Anthropic just published a detailed breakdown of how it sandboxes its AI agents across Claude.ai, Claude Code, and Cowork, and it’s the kind of documentation the industry rarely sees. According to Simon Willison, who flagged the writeup on his link blog, it’s “a fantastic overview” of techniques that usually stay hidden behind vague security marketing. Willison has a long-standing gripe with sandboxing products: they’re seldom documented well, and without detail, you can’t really trust them. This release tackles that head-on.

What Anthropic Actually Disclosed

The core idea is containment. Anthropic constrains where and how an agent can act using a stack of techniques: process sandboxes, virtual machines, filesystem boundaries, and egress controls. The goal, in their words, is to “set a hard boundary on what an agent can reach.”

The specifics vary by product:

  • Claude.ai runs on gVisor, Google’s container sandbox that intercepts system calls.
  • Claude Code, running locally, uses Seatbelt on macOS and Bubblewrap on Linux.
  • Claude Cowork spins up a full VM, using Apple’s Virtualization framework on macOS and HCS on Windows.

The principle underneath all of it is simple and smart. If credentials never enter the sandbox, they can’t be stolen. It doesn’t matter whether the threat comes from a careless user, a model finding a “creative” path around the rules, or an outright attacker. Cut off the reach, and you cut off the risk.

Why This Matters Now

Agentic AI is moving fast from demo to daily tool. Claude Code writes and runs code on real machines. Cowork operates inside a desktop environment. Once you give a model the ability to execute commands, touch files, and make network calls, the security question stops being theoretical. A prompt injection or a misbehaving agent isn’t just an embarrassing output anymore. It’s a path to leaked secrets or a compromised system.

What stands out here is the honesty. Anthropic didn’t just describe what works. They documented risks they missed, including an exfiltration vector through api.anthropic.com/v1/files that Willison had covered previously. Admitting a real hole you patched builds more trust than any glossy security page. It signals these defenses were pressure-tested, not theorized.

This also lands during a broader industry shift. As agents gain autonomy, the competitive question is no longer just “how capable is the model.” It’s “can I deploy this safely inside my company.” Transparent containment is becoming a feature, not a footnote. Expect enterprise buyers to start asking competitors for the same level of detail.

The Open Source Angle

There’s a practical hook buried in the news. Anthropic maintains an open source tool called srt, the Anthropic Sandbox Runtime. Willison says the project is now “mature enough” that he’s ready to give it a proper try. That’s worth noting. The same containment thinking Anthropic uses internally is available for developers to inspect and adopt, rather than locked inside a proprietary product.

Practical Takeaways

For anyone building or deploying AI agents, a few clear lessons come out of this:

  1. Keep secrets out of the blast radius. If credentials never reach the agent’s environment, exfiltration becomes impossible by design, not by policy.
  2. Match the boundary to the risk. A browser-based chat needs a different cage than a tool that runs shell commands on your laptop. Anthropic uses lighter process sandboxes for one and full VMs for the other.
  3. Control egress, not just access. Limiting what an agent can send out is as important as limiting what it can touch.
  4. Demand documentation from your vendors. If a tool claims to be sandboxed but won’t explain how, treat that silence as a red flag.
  5. Look at srt. If you’re rolling your own agent infrastructure, an open source runtime from a frontier lab is a strong starting point.

The larger signal is that security around AI agents is maturing from afterthought to design requirement. Labs that document their containment clearly will earn enterprise trust faster than those that hide behind buzzwords. Full details are available at the original source.

Scroll to Top