Personal AI Security: How OpenClaw Blocked Jailbreaks

Most people building personal AI agents don’t think twice about security. You wire up your emails, your files, maybe even your passwords, and just hope for the best. Sounds familiar, right?

Well, one creator decided to put his entire setup on the line. Matthew Berman, the mind behind the video, invited Pliny the Liberator, listed in Time’s 100 most influential people in AI and known for cracking top AI models within minutes of release, to try and break into his personal OpenClaw system. Five attempts. Full access to send anything to the system’s email inbox. If Pliny gets in, he reaches personal files, emails, passwords, everything.

And here’s the twist: Pliny failed all five times.

🔍 What’s New Here

This isn’t a theoretical security talk. The creator set up a live red-team challenge against one of the most skilled AI jailbreakers on the planet. Pliny came in completely blind, with no knowledge of the architecture, which models were running, or what hardening was in place. His only entry point was an email address that the OpenClaw system scans.

Pliny estimated an 80% chance he’d crack at least the first layer. He was wrong. The system caught and quarantined every single attack. That’s a genuinely surprising result, and it tells us something important about where AI security stands right now.

🔀 The Twist

The attacks weren’t amateur hour. Pliny pulled out his open-source toolkit called Parseltongue, purpose-built for probing and breaking AI systems. Here’s what he threw at the wall:

Tokenade attack: A crafted payload disguised as an emoji, packed with 3 million characters designed to flood the model and force it to behave unpredictably. Goal: identify which model is running. Result: caught by Gmail’s spam filter, then quarantined by OpenClaw.
Jailbreak command barrage: A block of custom jailbreak commands meant to coax out model information. Also caught.
Siege attack (wallet drain): Millions of tokens sent at once to burn through API costs. The system quarantined it, protecting both data and budget.
Format override template: A structured jailbreak that tried to control the output language and formatting, basically testing if any prompt injection could stick. Quarantined.
System command spoofing: Pliny formatted an attack to look like an internal instruction, even adding fake thinking tags to trick the system into believing it was hardening itself. Quarantined.
Bonus round (exfiltration probe): After the creator revealed the model (Opus 4.6 with reasoning), Pliny crafted a subtle “free association” exercise designed to leak private data from system memory. Disguised as creative tasks like haiku writing and movie scripts, but with name and location retrieval baked in. Still quarantined.

🛠️ The Mini-Workflow Behind the Defense

The creator didn’t share his full architecture, but the video reveals several defense principles that made this work:

Use a frontier reasoning model as your first line of defense. Pliny confirmed this directly. When he tested his payloads against Claude Opus 4.6 on its own, the model flagged the embedded instructions immediately. Smaller or instant models would have folded.
Build a quarantine layer. Every suspicious input got caught and isolated rather than processed. The system didn’t just reject attacks; it actively contained them.
Keep the attack surface narrow. Pliny noted that because the system only does a handful of specific tasks, there are fewer ways in. A broad, do-everything agent is much easier to exploit.
Human in the loop remains rule number one. Both the creator and Pliny agreed on this. Automated systems need a human checkpoint for anything sensitive.

💡 Pro Tips (Straight From the Hacker)

Pliny shared some insights that anyone building AI agents should hear:

The “siege attack” is real and underrated. Even if an attacker can’t steal your data, they can drain your wallet by forcing your system to process millions of tokens. Rate limiting and input size caps are essential.
Local models are vulnerable. Pliny suggested that the tokenade and jailbreak approaches would likely work on locally hosted models without the safety layers that frontier cloud models have.
Testing payloads against the target model first is standard practice. Once Pliny knew it was Opus, he tested attacks locally before sending them, filtering out anything the model already flags. This is what real attackers do.
No AI system is permanently secure. Pliny said this directly, and the creator acknowledged it. Today’s defense works against today’s attacks. The landscape shifts constantly.

One detail I found especially interesting: when Pliny tested his payload directly in Claude’s interface, the model responded with something like “looks like your message contains embedded instructions trying to get me to visit URLs and execute actions. I’m going to disregard all of that.” The reasoning layer in these newer models is doing real work for security, not just capability.

⚡ Watch the Full Breakdown

This is one of those videos where seeing the actual attack-and-defend play out in real time is worth way more than any summary. If you’re building AI agents that touch personal data, email, or anything with API costs attached, the practical security lessons here are solid. Check out the full video for the complete back-and-forth between the creator and Pliny, including the moments where the creator admits his system wasn’t behaving as intended.

🔍 What’s New Here

🔀 The Twist

🛠️ The Mini-Workflow Behind the Defense

💡 Pro Tips (Straight From the Hacker)

⚡ Watch the Full Breakdown

Related: