Four Trust States Built to Stop LLM Agents From Getting It Wrong

This shipped quietly last week. A trust layer system for LLM outputs, four states, MIT license on GitHub. The person who built it runs DeFi risk infrastructure where a wrong answer doesn’t just look bad. It can liquidate a position worth six figures before anyone notices something went sideways. That’s the context. Not a toy project. Not a hackathon proof of concept. A system built under real financial consequences, then open-sourced so the rest of us can use it.

The problem it solves is one you’ve probably felt. LLM systems are built to always produce output. Uncertainty gets smoothed over. Silence feels like failure. So the agent guesses, sounds confident, and does the wrong thing. The downstream system receives that confident wrong answer and acts on it. No warning. No flag. No indication that the model was operating outside its reliable range. In financial systems that means liquidation. In legal systems that means liability. In medical contexts that means harm. The confidence is the bug, and nothing in a standard LLM pipeline is designed to catch it.

The system maps every output to one of four states:

  • 🔒 VERIFIED, confirmed directly from source, zero transformation
  • 🔗 CONSISTENT, derived deterministically from verified data, logic is auditable
  • 📊 ESTIMATED, approximate, always labeled, never treated as fact
  • 🚫 REFUSED, intentionally withheld when sources are unavailable, inconsistent, or unsafe to interpret

That last state is the whole design. REFUSED is not a failure. It is the system working correctly. Think about what it actually means to have an explicit REFUSED state versus what most pipelines do today, which is return a low-confidence answer with no label attached. The REFUSED state turns “I don’t know” from an embarrassing edge case into a first-class output. It gets logged. It gets monitored. It gets treated with the same operational respect as a successful response. That shift alone changes how teams think about what reliable AI output actually looks like. You stop optimizing for “always answer” and start optimizing for “answer correctly or say nothing.”

When the same pattern got applied to LLM agent pipelines, agent overreach dropped significantly. Agents that previously filled gaps with plausible-sounding guesses now had a defined path to say nothing instead. The principle transfers across domains. Customer support agents, legal research tools, medical triage systems, financial forecasting pipelines. Any context where confident hallucination causes downstream damage benefits from having an explicit contract around output trust. The four states give your whole stack a shared vocabulary it never had before.

How to apply it in your own pipelines

  1. Label every output with one of the four states before it reaches downstream systems. Do this at the boundary layer, not inside the model prompt. The label lives in your code, not in the LLM’s response text.
  2. Define your REFUSED conditions upfront. What questions should your agent never attempt to answer? Write them down before you build. It is much harder to retrofit a REFUSED policy onto an agent that was designed to always respond than to build the contract in from day one.
  3. Surface the trust label in the actual output so callers know what they’re working with. A CONSISTENT answer and an ESTIMATED answer might look identical in plain text. The label makes the difference visible and actionable for whatever system consumes it next.
  4. Treat REFUSED responses as successful runs, not errors, in your monitoring. If your alerting fires every time an agent refuses to answer, your team will pressure you to make the refusals stop. That pressure is the enemy of reliability. Track refusals as feature behavior, not failure behavior.
  5. Audit your ESTIMATED outputs regularly. That’s where silent confidence creeps in. ESTIMATED answers are legitimate, but they need a review cadence and should never receive the same downstream trust as VERIFIED data. Schedule the audit the same week you ship.

Pro tip: The hardest part is cultural. Teams read silence from an AI as a bug. Reframe it. An agent that knows its limits is more reliable than one that always has an answer. Tie your REFUSED rate into your quality metrics. A system with zero REFUSED outputs is probably not calibrated correctly. It is either dealing with unusually clean inputs, which is rare, or it is papering over uncertainty with confident guesses, which is dangerous. A healthy REFUSED rate is a signal that your trust contract is actually working.

Full spec is open source. Worth reading even if you’re nowhere near DeFi. The pattern applies anywhere a confident wrong answer causes real damage. The four-state model is clean enough to implement in an afternoon and specific enough to give your whole team a shared language for talking about LLM output reliability. That shared language, more than any specific code, is what actually prevents the wrong answer from reaching production.

👉 Full spec on GitHub (MIT): github.com/etb-protocol/boundary-contract

Frequently Asked Questions

Q: How is this different from standard confidence scoring?

Confidence scores tell you how sure a model is, but trust layers go deeper: they explicitly declare why you should (or shouldn’t) trust an output, whether it’s directly verified, derived from verified data, approximate with caveats, or intentionally withheld. The key difference is REFUSED. It’s not low confidence; it’s a principled decision to stop rather than guess when sources are missing, contradictory, or too risky to interpret.

Q: When do you choose REFUSED over ESTIMATED?

ESTIMATED is for partial data you can approximate with explicit warnings. REFUSED is when data is fundamentally unavailable or unsafe to guess on. In DeFi (where wrong answers liquidate positions), refusing beats hedged guesses. Your choice depends on domain risk tolerance, the higher the cost of being wrong, the lower your bar for refusing.

Q: How does this work in multi-agent systems?

Use cross-model agreement: have multiple LLMs critique each other’s outputs. Claims that survive peer review get tagged VERIFIED or CONSISTENT; conflicting claims become ESTIMATED or REFUSED. You can also detect when refinement actually hurts quality (score regression) and return the best previous iteration instead of forcing more rounds.

Q: Won’t refusing output frustrate users?

Short-term, maybe, people want answers. But the opposite happens long-term: honest refusal with reasoning builds far more trust than confident wrong answers. Users spot overreach and stop relying on the system. A system that knows its limits is more trustworthy than one that confidently does the wrong thing.

タイトル:
I designed a trust layer system for LLM outputs
— VERIFIED / CONSISTENT / ESTIMATED / REFUSED

by u/Fun_Mountain_5328 in PromptEngineering

Scroll to Top