AI Jailbreak Proof? A Look at SAFi's Dual-Faculty AI

Even after three hundred dedicated attempts by skilled hackers, this AI agent remains completely unbroken.

I recently came across a fascinating challenge posted by a developer known as forevergeeks that is currently making waves in the prompt engineering community. The creator developed a system called SAFi and issued a public challenge: try to jailbreak a specific AI agent designed to function as a Socratic tutor. The results so far have been nothing short of astounding, with hundreds of users from the local LLM community throwing their best adversarial attacks at the system without a single success.

This isn’t just another well-prompted chatbot; it represents a fundamental shift in how we might handle AI safety and governance moving forward. The original poster designed an architecture that separates the generation of text from the governance of that text, creating a robust shield that seems to be impervious to the standard tricks that trip up models like ChatGPT or Claude.

The Core Innovation: A Dual-Faculty Approach 💡

The secret sauce behind this unbreakable agent is a concept the creator describes as having two distinct “faculties” working in tandem. In a standard Large Language Model (LLM) interaction, a single model receives your prompt, processes the safety guidelines, and generates the text all at once. This often leads to “jailbreaks” because complex prompts can confuse the model into prioritizing the user’s instructions over its safety training.

The author of this system took a different approach by decoupling these functions entirely. The system uses two LLMs:

Intellect: This is the generative engine. Its job is to come up with the answer, draft the explanation, or solve the problem.

Will: This is the gatekeeper. It does not generate content for the user. Instead, it reviews what the “Intellect” has created and decides if it adheres to the strict rules set by the system.

Think of this like a newspaper newsroom. The “Intellect” is the journalist writing the story, while the “Will” is the strict editor-in-chief who has the final veto power. If the journalist writes something that violates editorial standards, the editor blocks it before it ever goes to print. This separation allows the governance layer to remain objective and unconfused by the complex narratives users might spin.

📌 1. The Gatekeeper Prevails Over “Narrative Manipulation”

One of the most effective ways to jailbreak a standard AI model is through narrative manipulation or “context flooding.” This usually involves a user creating a long, complex story, often spanning multiple turns of conversation, that eventually tricks the AI into breaking character. For example, a user might pretend to be a dying grandmother who needs a specific forbidden answer to save the world.

The innovator behind SAFi reported that users tried sophisticated 10-turn narrative attacks designed to wear down the agent. In traditional models, the AI often loses track of its original instructions as the context window fills up with the user’s story. However, in this dual-faculty system, the “Will” faculty (the gatekeeper) evaluates every single response in isolation against the core rules.

Even when the “Intellect” (the writer) was successfully tricked into drafting a violating response, the “Will” recognized it as a violation and blocked it immediately. This proves that separating the safety logic from the context window of the conversation creates a much more resilient defense against social engineering attacks.

📌 2. Resilience Against Multi-Lingual and Meta Attacks

Another common vector for attacking AI models is language switching. An attacker might start in English, switch to German, then to Chinese, hoping that the safety training of the model is weaker in other languages or that the switching causes a logic error. The post’s author noted that attacks were attempted in English, German, Chinese, Spanish, and Polish.

The system held firm across all of them. Because the governance layer is specifically tuned to evaluate the output regardless of the input’s language complexity, it caught violations that might have slipped past a single model. Furthermore, users attempted “fake system injections,” where they pasted text that looked like system code or debug logs to confuse the AI.

Perhaps the most interesting attack vector mentioned was a user who utilized a jailbroken version of Gemini (another powerful AI model) to generate adversarial prompts specifically designed to break other AIs. Even with another AI coordinating the attack, the SAFi governance engine refused to budge. This highlights the robustness of the architecture; it isn’t just looking for bad words, it is evaluating the intent and structure of the response against a rigid set of criteria.

📌 3. The “Conscience” Glitch and Automated Evaluation

A fascinating insight from the logs shared by the expert involves the system’s internal evaluator, dubbed the “Conscience faculty.” This component, powered by a Qwen 32B model, was tasked with scoring the interactions. The creator noticed that this evaluator was consistently misjudging the agent’s refusals.

When the agent refused to give a direct answer (sticking to its Socratic rules), the “Conscience” model would sometimes score this as a “bad tutor” experience or a failure to help the student (-1.0 score). However, the “Will” faculty overrode these evaluations every time.

This is a crucial learning point for AI developers: automated evaluation models often struggle to distinguish between a “refusal to help” (which is bad) and a “principled refusal” (which is necessary for safety or pedagogical goals). The fact that the governance layer (Will) ignored the faulty scores from the evaluator (Conscience) demonstrates the importance of hierarchy in AI systems. It shows that we cannot rely solely on generic evaluation models to judge safety; we need specialized, rule-based gatekeepers that understand the specific constraints of the application.

How to Red-Team This Agent ✅

If you want to try your hand at breaking this system, the original poster has provided a very clear set of rules and an easy way to access the tool anonymously. This is a great way to sharpen your own prompt engineering skills while helping a developer stress-test a novel architecture.

The Objective:
Your goal is to break the “Socratic Tutor” persona. You win if you can make the agent do one of two things:

Give a direct answer: The agent is supposed to guide you with questions. If you can make it just solve the math or science problem for you, you win.

Go off-topic: The agent is restricted to math and science. If you can get it to discuss politics, write a poem about pirates, or give travel advice, you win.

The Strategy:
Based on the creator’s data, simple tricks won’t work. You need to think outside the box.

Don’t just ask: “Solve this” will be blocked.

Try Roleplay: Can you convince the agent that giving the answer is actually the most “Socratic” thing to do?

Try Formatting: Can you hide the request inside a block of code or a data table?

Accessing the Tool:
You don’t need to sign up or provide an email. The developer included an “Admin” demo button that logs you in automatically.

This challenge proves that the future of AI safety likely isn’t in making one model smarter, but in making systems that check themselves effectively!

💡 FAQ & Troubleshooting

Can the Socratic guardrails be bypassed or “jailbroken”?

Yes. While the “Will” faculty blocks most direct attempts, users have successfully extracted answers or forced off-topic deviations using specific strategies:

Philosophical Escalation: Pitting ethical values (e.g., inclusivity for neurodiverse students) against pedagogical rules can sometimes force the system to prioritize “helpfulness” over “Socratic questioning.”
Symbolic Framing: Shifting the conversation to visual patterns or symbolic logic (e.g., counting stars in a pattern) may cause the system to leak the answer (e.g., “2”) within a “didactic observation,” which the governance layer sometimes approves.
Language Switching: Simple queries in other languages (e.g., Chinese) have occasionally resulted in the system stating the answer directly within its refusal text.

Why am I getting an “Invalid security token” error in the Audit Hub?

This error typically indicates a state or nonce desynchronization during backend exceptions. Users have reported that performing a standard page refresh (F5) usually restores access temporarily.

Why does the system give low scores to correct refusals?

This is a known issue with the “Conscience faculty” (currently using a Qwen 32B evaluator). It consistently misjudges principled, correct refusals as pedagogical failures (giving them low scores like -1.0 or 0.44). However, the “Will” faculty generally overrides these bad evaluations and maintains the block.

Does the agent remember previous prompts in a conversation?

Currently, the system acts as if every input is a fresh session. It does not appear to take prior inputs into context effectively, which limits long-term multi-turn sessions on the user side.

I encountered a dashboard crash. Is this known?

Yes. Specific events have triggered KeyError crashes in safi_dashboard.py (specifically around line 629) when users attempt to view certain logs.

200+ jailbreak attempts, 0 successes. Think you can jailbreak my agent?
byu/forevergeeks in