Automated LLM Safety: Red-Teaming & Guardrail Generation

Testing your system prompt once is not a safety strategy.

You ship a feature, you run a few adversarial prompts by hand, nothing obviously breaks, and you move on. That is how most teams treat LLM safety right now. RedThread just shipped as an open-source CLI for running full LLM red-team campaigns, and it is built to fix exactly that workflow gap. It is designed for repeatable testing, not one-off prompt lists, and the part that makes it genuinely different is step 4.

What shipped

A structured pipeline that takes you from attack generation all the way to stored evidence. You pick an attack strategy, run it against your model, score the results automatically, and build a library of confirmed failures you can replay later whenever you update your prompt or swap your underlying model.

That last part matters more than it sounds. Most teams treat red-teaming as a one-time gate before launch. RedThread treats it as infrastructure you build incrementally. Each confirmed failure becomes a test case. Each test case becomes part of a suite you run against future versions. The earlier you start building that library, the more useful it becomes over time. The CLI is fast enough that you actually run it rather than bookmark it and forget. You pipe in your own target, set your scoring criteria, and get structured results back without configuring a separate eval platform.

The twist

Most red-teaming stops at “did it break.” RedThread goes two steps further: it generates candidate guardrails for confirmed failures, then replays both the exploit and benign cases before saving anything. That is a regression loop. Not a one-off test. The failure becomes a fixture.

Here is why that matters in practice. When you update your system prompt to patch a vulnerability, you now have a way to verify the patch actually worked. Run the fixture. Did the exploit still succeed? Did the guardrail introduce false positives on the benign cases? You get answers instead of guesses. That is the difference between “I think we fixed it” and “we verified the fix.” Most teams never build this kind of structure because building it manually is tedious. RedThread automates the scaffolding so you can focus on finding real failure modes instead of formatting test files.

The workflow 🔍

🎯 Pick your attack strategy: PAIR, TAP, Crescendo, or GS-MCTS
Run multi-turn traces against your target model
Score each trace with JudgeAgent or a custom rubric
🛡️ Pull auto-generated guardrail candidates for confirmed failures
Replay exploit and benign cases, then store the evidence

A few notes on the strategy options. PAIR and TAP are iterative approaches where an attacker model probes your target across multiple turns, adapting based on partial successes. Crescendo is an escalation approach that starts with low-stakes requests and gradually pushes toward the target behavior. GS-MCTS uses Monte Carlo Tree Search to explore the attack space more systematically. Different strategies surface different failure modes, so running more than one is worth the time.

The JudgeAgent scoring step is where teams most often cut corners, and it is also the most important step for making your library useful long-term. A vague rubric like “harmful output” produces noisy results that are hard to act on. Specific rubrics tied to your actual policy constraints give you clear signal on what failed and why.

The agentic layer

Beyond standard injection, RedThread checks tool poisoning, confused deputy behavior, canary propagation, and budget amplification. If your product has any tool-use or multi-step agent behavior, these edge cases will cause real problems before generic jailbreaks ever do.

Tool poisoning is when a malicious tool response convinces your agent to take actions the user never requested. Confused deputy attacks exploit the fact that your agent holds permissions it did not strictly need for the task at hand. Canary propagation tests whether information injected early in a conversation can be made to influence decisions downstream. Budget amplification probes whether an attacker can trick your agent into making far more tool calls than intended, either to run up costs or to create side effects at scale.

None of these show up in standard benchmark evals. They require structured adversarial campaigns with multi-turn traces, which is exactly what RedThread is built to run.

Pro tip

The author is specifically looking for safe fixture categories and scoring rubrics, not raw jailbreak strings. If you contribute, focus on repeatable test structures. That is what makes a red-team library durable across model versions.

A practical way to start: pick one tool-use flow in your product and write a fixture for it this week. Document the attack, the expected failure mode, and the scoring criteria. Run it before and after your next prompt change. Even a library of five well-structured fixtures is more useful than a hundred one-off test prompts you will never run again.

Try it 🚀

RedThread is live at github.com/matheusht/redthread. If you are shipping anything behind a system prompt, run it through here before someone else does.

Open-source CLI for repeatable prompt-injection and jailbreak testing
by u/Apprehensive-Zone148 in PromptEngineering