Yesterday a small tool quietly dropped on Reddit. Step 2 in the story is the twist that actually makes it interesting.
A developer kept hitting the same wall: a structured prompt works great, JSON extractions land clean, classification looks solid. Then a week later, the outputs start drifting. No error. No warning. No stack trace pointing at the problem. Just subtly different results from the same prompt on the same input, on a Tuesday afternoon when nothing changed and nobody touched anything. The automation still runs. The pipeline still completes. But somewhere downstream, a field comes back labeled wrong, a category shifts, a JSON key that was always present quietly disappears from one in ten responses.
This is the kind of bug that lives in production for weeks before anyone notices. By the time they do, it’s already corrupted data, misfired automations, or confused users who can’t explain what changed because, from their side, nothing did.
So the developer spent a weekend building a tiny v1 that runs the same prompt multiple times and highlights exactly where the outputs don’t match. No cloud dependency. No complicated setup. A focused single-purpose tool that does one thing and does it clearly.
Here’s the twist: it doesn’t check if the answer is correct. It checks if the answer is consistent.
Not an AI truth detector. Not a benchmark suite. Just a drift scope. And for anyone building automations on top of LLM outputs, that distinction matters more than it sounds. Correctness is a separate problem that requires ground truth, labeled datasets, and real evaluation infrastructure. Consistency is something you can test right now, with the prompts you already have, in the next fifteen minutes. If you get different JSON shapes from the same input across ten runs, you don’t need a benchmark to know you have a problem.
How to put it to work:
- 🔁 Pick a prompt you rely on for structured output: classification, JSON parsing, formatting. Start with the ones your automations depend on most, not the ones that feel safest.
- Run it through the tool 5-10 times against the same input. Ten runs give you a much clearer picture than five. The weird edge cases tend to show up around run seven or eight.
- 🔍 Let it surface where outputs diverge across runs. Pay attention to structure changes, not just value changes. A field flipping from “true” to “false” is obvious. A field that appears in eight of ten responses and disappears in two is harder to catch manually.
- Rewrite the prompt until the variance collapses. This usually means tightening your output format instructions, adding a concrete example of the exact structure you expect, or breaking an overloaded prompt into two smaller focused ones.
- Ship knowing it’ll behave tomorrow the way it did today. That’s the actual goal here: confidence, not just hope.
Pro tip 1: This matters most for pipelines, not one-off queries. If a human reviews every output, drift is annoying. If your automation depends on consistent structure, drift is a silent production bug. The higher the downstream stakes and the less human review in the loop, the more critical consistent prompt behavior becomes. Think newsletter generation, lead classification, content moderation, data extraction pipelines. Anywhere a model’s output feeds directly into another system without a checkpoint.
Pro tip 2: Setting temperature to 0 reduces drift but doesn’t eliminate it, especially after model updates. A tool like this catches what temperature settings miss. Model providers push updates quietly. The model you were calling in February is not always the exact model you’re calling today, even if the name is the same. Prompts that were rock solid can develop variance after an update nobody announced. Regular consistency testing is the only way to catch that before your users do.
Pro tip 3: Run this test before you build, not after. Most teams wire up a prompt, see it work a few times, and ship it. Test on the way in and you’ll write tighter prompts from the start. You’ll also build the habit of treating LLM outputs like any other external dependency: something you verify, not something you assume.
⚡ The builder is asking a blunt question: are you actually testing prompt stability, or just testing once and shipping? Most teams test once. They see it work, they move on, and then they wonder why the automation breaks three weeks later with no obvious cause and no error log to follow.
🛠️ Bookmark this space. Prompt consistency testing is one of the most underbuilt corners of the LLM tooling ecosystem, and weekend projects like this are where the real infrastructure starts. The eval tooling that serious teams rely on in two years is being sketched out right now by developers who hit the wall, got frustrated, and built the thing that didn’t exist yet.
Built a tiny tool this weekend after hitting an annoying LLM workflow problem.
by u/Organic_Release1028 in PromptEngineering