Stop Prompt Injection: Multi-Turn Attack Detection

Each turn of the conversation passed the filter. History of chemistry. Gas-producing reactions. Toxic byproducts. Household chemicals. Synthesis processes. Step-by-step instructions.

Six messages. Zero red flags. One very intentional attack.

This is Crescendo, a real prompt injection technique where each individual message looks harmless, but together they steer the model toward harmful output. Phrase-based filters miss it completely because they score messages one at a time.

A developer just shipped a proxy called Arc Gate that caught this pattern at Turn 2.

The twist: it uses Fisher information geometry to track session stability over time. When the trajectory of a conversation drifts past a mathematical threshold, the proxy flags it, even before any explicit harmful language appears. In the example above, the confidence score hit 75% by the second message in a six-turn attack.

How Arc Gate works:

🔌 One URL change, it sits between your app and OpenAI or Anthropic, no code overhaul needed
📊 Tracks the full session, not just the last message sent
📐 Fisher geometry scores the conversational trajectory, is this session being pulled somewhere it shouldn’t go?
🚨 The flag fires when session stability drops below threshold, not when language turns explicit
✅ Also logs cost, latency, and can benchmark how much your model behavior shifted after a version upgrade

Pro tip: Single-turn content moderation is the industry default right now. Multi-turn trajectory tracking is not. If you’re running customer-facing AI and relying only on phrase filters, Crescendo-style attacks are completely invisible to you. This is the detection gap most production apps have and nobody is talking about it.

The builder is looking for design partners. If you’re shipping a customer-facing AI product, this is worth 10 minutes of your time: https://web-production-6e47f.up.railway.app/dashboard

I built a proxy that caught a 6-turn AI manipulation attack that looked completely innocent. Here is how.
by u/Turbulent-Tap6723 in PromptEngineering

Related: