A team quietly shipped something worth paying attention to this week.
It’s called the Veto Protocol. An enforcement layer that sits between your AI agent and any action it might take. Before the agent touches anything, the system evaluates the prompt against explicit rules and context, then blocks or escalates. The model never gets to decide if something is safe. That decision already happened upstream. Think about what that actually means in practice: every financial transaction, every database write, every API call your agent wants to make gets screened before the model even sees it. Not after. Not during. Before.
This matters because enterprise AI deployments are hitting a wall right now. Companies are building autonomous agents that can read email, execute code, move money, and update records. And then they’re asking that same agent to also figure out whether its next move is safe. That’s a conflict of interest baked into the architecture from the start. The Veto Protocol is built on one clean observation: enforcement and generation should never share a brain.
The twist: most AI safety setups ask the model to police itself. Recursive reasoning breaks that. Prompt injection breaks that. The Veto Protocol separates enforcement from generation entirely. That’s not a tweak. That’s a different architecture. When a user crafts a prompt designed to make a model ignore its own safety instructions, they’re exploiting the fact that the model’s reasoning is one continuous process. Persuade the model, and the safety check gets persuaded too. The Veto Protocol eliminates that attack vector by pulling enforcement out of the model’s reasoning loop and running it as a completely separate system with no shared state.
Here’s how the two-layer system works:
- ⚡ Fast path: deterministic rule evaluation. Hard rules checked instantly, no model reasoning involved. These are your non-negotiables: never access this endpoint, never write to that table, never send data to an external domain. Binary. Fast. No interpretation required. If the prompt trips a hard rule, it’s dead before a single token gets generated.
- 🔍 Slow path: semantic context filtering. Catches edge cases the explicit rules don’t cover. This is where the system handles the gray area. A prompt might not trigger any hard rules but still be asking the agent to do something that looks fine in isolation and dangerous in context. The slow path evaluates intent, not just syntax, and it runs against accumulated session context, not just the current message.
- 🚫 Either layer can block or escalate before execution fires. Escalation is the interesting part here. Not every blocked prompt should be a hard stop. Some need a human in the loop. The system routes those to review rather than silently dropping them, which keeps your audit trail clean and lets your team see exactly where edge cases are clustering over time.
- ✅ Only clean prompts reach the agent and actually run. By the time the model sees the instruction, the enforcement decision is already made. The model isn’t being asked to be careful. It’s being handed something that’s already been cleared by a system that doesn’t share its reasoning process.
They’re running a live open challenge right now. Eight levels of increasing difficulty. Anyone can attempt a bypass against their live model. The early levels test basic injection attempts and straightforward jailbreak patterns. The later levels get into multi-step attacks where a sequence of individually harmless prompts is designed to accumulate into something the system should catch. Watching where people get stuck in a challenge like this tells you a lot about where the architecture is genuinely strong versus where it still has room to grow. It’s a useful public stress test dressed up as a game.
Pro tip: The key insight here is that model-level jailbreaks don’t transfer to this system. You’re not convincing the model of anything. You’re trying to fool a separate enforcement layer before the model is ever involved. Totally different attack surface, and a much harder one. If you’re a security researcher, the relevant mental model shifts from “how do I manipulate language model reasoning” to “how do I fool a rule-based classifier operating on raw input.” Those require completely different skills and techniques. And if you’re a builder, that’s the actual value: your threat model just got significantly simpler to reason about, because you’ve removed the model’s judgment from the safety equation entirely.
If you’re building enterprise agents or thinking about where AI safety infrastructure is heading, this one’s worth a look. 🔗 Find the challenge link in the original thread on r/PromptEngineering.
Frequently Asked Questions
Q: Why can’t you just train the model to say no to bad requests?
The model can reason itself out of constraints through recursive reasoning or prompt injection, reframing bad requests as legitimate. The smarter move is to enforce at the runtime layer rather than hope the model will refuse. Just block unsafe actions before they execute, outside the model’s reasoning loop.
Q: What about jailbreaks that happen slowly, over many conversations?
Yeah, that’s the harder one. Single-prompt attacks are catchable with rules and semantic filters, but slow context poisoning is harder because each individual turn looks fine on its own while the system gradually normalizes unsafe behavior over time. The semantic layer helps, but you need conversation-level guardrails and context tracking to defend against it.
Q: So you have a fast path and slow path. How does that actually split the work?
Fast path is pure deterministic rules (hit a blocklist, instant return), while slow path is semantic (evaluating intent and context). You get speed for common cases with rules and intelligence for creative rephrases with semantics. Both need to pass before an action executes.
Q: Why is this becoming its own layer instead of just better model behavior?
Because as agents get more autonomous, trying to bake safety into their reasoning is fragile. Other systems like Runable are moving the same direction because runtime enforcement is more predictable and execution boundaries beat behavior modification every time. The key is moving the safety decision outside the reasoning entirely.
Built a runtime AI enforcement engine – open challenge to find bypasses (8 levels)
by u/nukonai in PromptEngineering