Forcing AI to show its work, not just its answer

A developer dropped reClaim on GitHub this week. It’s a system prompt framework that forces any frontier model to show confidence scores, source rankings, and unresolved contradictions alongside every answer.

The twist: before the model responds, it runs an adversarial check against its own conclusion. It actively tries to poke holes in what it just figured out. That’s the part most prompt builders skip entirely.

Most people building with AI have hit the same wall. You ask a model a complex research question, get back a confident, well-structured answer, and only discover three days later that one of the core “facts” was either outdated, contested, or quietly made up. The model didn’t hedge. It didn’t flag uncertainty. It sounded exactly like it does when it’s right. reClaim is a direct fix for that problem, built at the prompt layer so it works with whatever model you’re already using.

What’s Actually New Here

Standard AI responses sound confident even when the model is guessing. reClaim breaks that at the prompt level:

  • Every claim gets a 3-axis confidence score: Source Strength, Contradiction Resistance, Completeness, displayed as [A:xx B:xx C:xx → Overall]
  • Sources are tiered from Tier A (peer-reviewed research, government docs) down to Tier D (blogs, social media)
  • Contradictions between sources don’t get smoothed over. They get documented and explained separately
  • A mandatory internal scratchpad forces the model to reason before the answer surfaces

The 3-axis scoring is where this gets interesting. Source Strength measures how credible the underlying material is. A claim backed by a 2024 Nature study scores differently than one pulled from a Medium post. Contradiction Resistance measures how well the claim holds up when the model tries to break it. If three Tier A sources all agree, resistance is high. If one peer-reviewed paper says X and two others say the opposite, that score tanks and you see it immediately. Completeness captures how much of the relevant evidence space the model actually covered.

The source tier system is worth pausing on. Tier A is peer-reviewed research, government data, official documentation. Tier B is established institutions and major news outlets with editorial standards. Tier C is expert commentary and reputable secondary sources. Tier D is blogs, social posts, anonymous forum threads. When the model tells you its answer leans heavily on Tier C and D material, you know to treat it accordingly. When it’s all Tier A, you can lean on it with more confidence.

The internal scratchpad is the piece that actually changes output quality, not just the display. By forcing the model to reason step by step before surfacing an answer, you get cleaner logic chains and catch more errors before they become confident-sounding conclusions in your final output.

The Mini-Workflow

Four modes, pick your depth:

  1. 🔍 /short, quick answer + confidence score
  2. 📊 /standard, result + fact table + full evidence base
  3. 🔬 /deep, complete methodology + conflict resolution
  4. 🗺️ /deep+, everything above plus a Mermaid diagram of the evidence structure

Drop the system prompt into any model that supports system prompts. Works with ChatGPT, Claude, or whatever frontier model you’re using. English and German versions are both available.

The mode you pick should match what you’re actually doing with the output. /short makes sense for quick sanity checks, background context you’ll dig into separately, or situations where you just need a directional read. /standard is the default for factual research, content sourcing, or any time you’re building something others will read or rely on. /deep is for contested topics, anything with policy implications, or claims you’re about to repeat publicly where being wrong has real consequences. /deep+ is for analysis you need to share with a team, because the Mermaid diagram turns the evidence structure into something visual you can review, annotate, and discuss together.

Worth noting: the framework adds real length to responses. /deep+ on a complex question can run long. That’s a feature, not a bug, but match the mode to whether you actually need the full picture or just a directional read.

Pro Tip

Use /standard as your default for factual research. If the fact table shows a conflict, that’s your signal to run /deep on that specific claim only. You get the precision without wading through full methodology every time.

A second tip worth keeping: run /deep on the specific claim that matters most in your argument, not the whole research question. If you’re writing something that hinges on one key statistic or causal relationship, isolate that claim and run the deep methodology on it alone. Everything else stays at /standard. This keeps response length manageable while directing maximum scrutiny at the part where being wrong costs you most.

The full framework is open source on GitHub. Test it first on a question you already know the answer to. Watching reClaim score its own confidence on something you can verify is the fastest way to understand how useful it actually is. 🎯

I built a verification framework that forces AI to show confidence scores, source tiers, and unresolved conflicts — not just answers
by u/PlentyDiscount2073 in ChatGPTPromptGenius

Scroll to Top