LLM Confidence Scores: SutniPrompt Reveals AI's Deeper Flaw

Yesterday a clever build shipped. An open-source system prompt framework called SutniPrompt hit v0.6.0-beta, and the headline feature is a mandatory confidence score before every final citation. The twist? The author caught it misfiring almost immediately. This is one of those situations where the attempt to solve a real problem reveals a deeper one, and that makes it worth paying attention to.

Here is what the project actually does first, because the confidence piece makes more sense in context.

What SutniPrompt Does

SutniPrompt is a system prompt framework built by u/sutnip that enforces a strict output contract on any LLM you drop it into. It strips pleasantries, locks the model into clean Markdown, blocks those hallucination walls you get when a prompt is too vague, and requires timestamps plus Wikipedia citations on factual claims. Think of it as a style guide your model cannot ignore.

In practice, this means your responses stop opening with “Certainly! Here’s what I found…” and start with the actual content. The citation requirement is particularly useful for factual queries because it forces the model to anchor claims to a retrievable source, even if that source is a Wikipedia article. It does not guarantee accuracy, but it creates an audit trail you can follow after the fact.

Version 0.6.0-beta adds one big thing: before the final citation block, the model must output a structured confidence score in the format confidence: X% ± Y%. The idea is to force self-assessment. Instead of the model just stating a fact and moving on, it has to publicly commit to how sure it actually is. That commitment is supposed to make overconfident responses visible before they ship.

The Twist

The author noticed something uncomfortable pretty fast. The confidence scores look convincing but have nothing real backing them. LLMs do not have calibrated internal probability distributions the way a Bayesian model does. When you ask the model to output confidence: 87% ± 5%, it generates that number the same way it generates everything else: by predicting what a plausible-looking response looks like. The community put it plainly: the metric mimics epistemic honesty without the substance.

This is worth sitting with for a second. A well-calibrated model would say “87%” when roughly 87 out of 100 similar claims it makes are actually correct. That requires tracking outcomes over time, comparing predictions to reality, and updating accordingly. None of that happens inside a single inference pass. The model has no memory of how its past confidence claims performed. It is generating a number that sounds reasonable, not one that reflects measured accuracy.

So the framework designed to fight overconfidence introduced a new flavor of it. Precise-looking numbers that signal rigor while measuring nothing. That is a genuinely interesting failure mode, and one that shows up more broadly whenever you ask a generative model to audit its own reasoning.

How to Try It Right Now

⚙️ Clone the repo: git clone https://github.com/sutnip/sutniprompt
Drop the system prompt into your preferred LLM interface or API call. It works with any model that follows system instructions, including GPT-4, Claude, and local models via Ollama
Run a factual query and read the confidence block that appears before the citation
Cross-reference the stated confidence against the actual source quality. A confidence of 91% backed by a stub Wikipedia article is a useful data point
📋 Note where the percentage feels disconnected from what the citation actually supports

Pro Tips

Use this version as a diagnostic, not a trust signal. The confidence block is useful for spotting where the model is hedging versus where it is performing certainty. When the score is high and the citation is thin, that gap is the signal worth chasing.
Pair it with a secondary fact-check pass on anything above 85%. High stated confidence with a weak citation is a red flag the framework surfaces but cannot prevent. Treat that threshold as your review trigger, not a pass/fail gate.
Watch the v0.7.0 release. The author is replacing numeric percentages with a qualitative discrete scale plus named uncertainty drivers. That approach has better grounding in how humans actually communicate uncertainty. Something like “low confidence, source is secondary” tells you more than “34% ± 12%” ever could.

The core problem SutniPrompt is poking at is real. LLMs present information with uniform confidence regardless of whether they are summarizing a peer-reviewed paper or confabulating a detail. A forcing function for self-assessment is a reasonable engineering response to that. The challenge is that the model generating the confidence score is the same model with the calibration problem. You are asking the thing that hallucinates to report on how often it hallucinates, in real time, during the hallucination.

v0.7.0 looks more promising precisely because it sidesteps the fake-precision trap. Named uncertainty drivers give you something to audit. A percentage gives you something that looks like math.

🔗 Full framework and docs at github.com/sutnip/sutniprompt. Worth a read even if you build your own prompting conventions, because the failure modes it is trying to solve are ones you are probably already hitting.

Frequently Asked Questions

Q: Are the confidence percentages actually real, or just convincing-looking fabricated numbers?

That’s a fair technical critique, LLMs don’t have true calibrated internal probability scores, so the percentages are estimates, not ground truth. But the real goal isn’t perfect mathematical calibration; it’s forcing a pause-and-reflect moment. By requiring the model to output a confidence metric, you’re pushing it to acknowledge its limitations rather than defaulting to overconfidence. It’s a design pattern for epistemic humility, not a probability guarantee.

Q: How does forcing confidence metrics compare to making the model list counterarguments or missing data first?

Both work, but they operate differently. Some users find numerical confidence more actionable for automated systems; others prefer explicit counterargument steps because they force deeper logical reasoning. The good news: you don’t have to choose. Many find combining them strongest, reflection *then* confidence rating creates a two-stage check.

Q: Won’t qualitative scales (High/Medium/Low) just shift overconfidence from numbers to words?

Solid point. Instead of asking the model to rate itself semantically, you could require it to explicitly list missing data points or assumptions before assigning any confidence level. That “check before you rate” approach is more robust than standalone judgment calls.

Q: Can I integrate SutniPrompt into my existing API or application?

Since it’s open-source, it depends on your setup. For cloud APIs (OpenAI, Claude, Gemini), you’d replace the system prompt with SutniPrompt’s framework. For local models, it’s more straightforward. The GitHub repo includes integration guides for common platforms and workflows.

LLMs are notoriously overconfident, so I updated my system prompt to force a statistical “Confidence Metric” (SutniPrompt v0.6.0-beta)
by u/sutnip in PromptEngineering

What SutniPrompt Does

The Twist

How to Try It Right Now

Pro Tips

Frequently Asked Questions

Related: