AI Agent Security: Defend Against Adversarial Attacks

Somewhere out there, an AI is inventing new attack techniques to break into your AI agents. Not pulling from a known exploit catalog. Inventing brand new adversarial methods, on its own, at scale. This is what the Claudini paper (arxiv 2603.24511) documents, and u/willynikes from r/PromptEngineering dropped it this week alongside a defense approach they built and actually measured.

The paper describes an autoresearch pipeline that automatically discovers novel adversarial attack algorithms. It doesn’t rely on human researchers finding edge cases or cataloging known jailbreaks. It generates, tests, and refines new attack strategies in a continuous loop, targeting whatever model it’s pointed at. On hardened models where every known method sat below 10% success, this pipeline hit 40%. On Meta’s SecAlign 70B, a model specifically trained and designed for security alignment, transfer attacks achieved 100% success rate. A model built to resist attacks got completely dismantled by attacks invented by another AI.

Let that land for a second before we talk defense.

🔒 The attack surface hiding in your current setup

Most people building with Claude Code or any agentic framework right now have exactly zero active defense layer. There’s a skills file or a system prompt that tells the model what to do. Nothing tells it what NOT to do when it encounters adversarial content in a tool output, a fetched webpage, or an API response.

Think about what a typical agent actually does in a single run. It browses the web. Reads files. Calls external APIs. Processes retrieved documents. Every single one of those channels is a potential injection surface. A malicious instruction embedded in a webpage your agent fetches looks identical to a legitimate tool result unless something is actively filtering for it. The injection doesn’t need to be obvious either. It can live in a comment inside a webpage’s HTML, a hidden field buried in a JSON response, or a single line at the bottom of a retrieved document that your agent reads and acts on before anything else in your stack has a chance to intervene.

One commenter in the thread said it best: “so we built AI agents that can browse the web, read files, and call APIs and only NOW we are asking should we tell it what NOT to do.” That’s exactly the gap the Claudini results expose.

🛡️ What the poster built and what the numbers showed

🎯 An evaluated defense skill, not just a prompt. The original poster built a prompt injection defense skill and tested it using the same methodology Claudini uses for attacks: automated pipeline, binary pass/fail scoring, no subjective judgment. Three independent models judged results blind (Claude, Codex, and Gemini). Baseline resistance moved from 70% to 88%. That’s 18 percentage points of improvement, measured across 10 adversarial test cases with held-out data the defense skill had never seen before.
📉 18 points is bigger than it sounds when you do the math. At 70% resistance, your agent gets compromised roughly 3 out of every 10 attempts. At 88%, it’s barely over 1 out of 10. In production with real attack volume, that gap is the difference between your agent leaking system prompt details and API keys versus not. The stakes scale with the number of runs. If your agent processes thousands of external documents a week, even a small improvement in resistance rate translates directly to fewer successful compromises.
🔄 The evaluation method matters as much as the skill itself. The Claudini paper explicitly states that defense evaluation should incorporate autoresearch-driven attacks. A skill file that has never been tested against automated adversarial inputs is just untested guesswork. The poster’s framework evaluates defense against the same class of attacks that are actually being weaponized, using blind judging to remove scoring bias.

The framing the poster uses is smart: think of evaluated behavioral skills as antivirus for your AI stack. You don’t run production servers without a firewall. The same logic applies to any agent that is actively consuming external data from the web, files, or APIs. And just like antivirus, the defense is only as good as the threats it has been validated against. A signature database from two years ago won’t catch what’s being deployed today, and the same is true for a system prompt you wrote once and never stress-tested.

💡 Where to take this from here

The Claudini paper and the poster’s full eval report, including all 10 test cases and the blind scoring methodology, are both linked directly in the original Reddit thread. Even if you’re not ready to implement a full defense skill today, the evaluation framework is worth studying on its own. It’s a rigorous way to actually know whether your defenses work instead of just assuming they do.

The attacks are automated now. The poster’s argument is that defense needs to be too. Given what the Claudini results showed on a model literally built for security alignment, that’s not a theoretical concern anymore.

Head to the original post in r/PromptEngineering to grab both links and dig into the discussion. There’s more context in the comments on how the blind judging was run and what the test cases actually looked like.

Frequently Asked Questions

Q: What is prompt injection, and why should I care if I’m building AI agents?

Prompt injection happens when adversarial content in tool outputs, retrieved documents, or API responses tricks your model into ignoring its instructions. Since agents browse the web, read files, and call APIs, every one of those channels is a potential injection surface. If exploited, an attacker could leak your system prompt, steal API keys, or make your agent behave unexpectedly.

Q: Aren’t models like Claude already protected from prompt injection?

While Claude is more resistant than many models, research shows that even hardened models can be exploited by sophisticated, automatically-generated attacks. Base Claude without additional defense layers leaves you vulnerable, you need an explicit defense skill in your workflow.

Q: How much does a defense skill actually improve my resistance?

A properly-implemented defense skill can improve your resistance from 70% to 88%, reducing successful attacks from 3 in 10 attempts to barely over 1 in 10. In production, that’s often the difference between staying secure and getting compromised.

Q: What’s the difference between a defense skill and just telling my model “don’t get injected”?

Informally telling a model what not to do fails reliably. A defense skill uses evaluated behavioral instructions that are tested systematically, the same way attacks are tested. It’s the difference between hoping for security and having measured, repeatable protection.

Prompt Injection Defense 101: The Claudini Paper and Defense Hardening with skills.
by u/willynikes in PromptEngineering

🔒 The attack surface hiding in your current setup

🛡️ What the poster built and what the numbers showed

💡 Where to take this from here

Frequently Asked Questions

Related: