He Thought His Prompt Was Fine. It Scored a 2.32.

He tweaked a prompt. Looked at two outputs. Decided it “looked better.” Moved on.

Sound familiar? Most of us run prompts the same way we run vibes-based cooking. Add something, taste it, shrug and say “close enough.” The problem is you never actually know what changed or why it worked. And when the dish turns out terrible for the third time, you’re back to square one with no idea what went wrong.

One engineer in r/PromptEngineering decided to fix that. He started scoring his prompts. His first baseline on a prompt he thought was solid? A humbling 2.32 out of 10.

Not a draft. Not a first attempt. A prompt he had already iterated on and felt good about.

🎯 Why This Matters More Than You Think

“It looks better” is not a metric. It’s a feeling. And feelings don’t scale.

When you’re building AI agents that handle dozens of edge cases, vibes-based evaluation will fail you. Not sometimes. Every time a weird input hits a blind spot you didn’t know existed.

Think about what that means in practice. You deploy a content generation workflow. It works great on your test articles. Two weeks later a user feeds in a product page with bullet points and pricing tables, and the output is a mess. You didn’t see it coming because you only tested with the inputs that were convenient for you, not the inputs that would actually show up.

The real discovery wasn’t just the score improving (it jumped to 7.86 in two iterations). It was seeing that the prompt failed the same three types of input consistently. Not randomly. Predictably. Once you can see the pattern, you can fix it. That shift from “something’s off” to “this specific input type breaks the format every time” is the difference between guessing and engineering.

🔧 How the Scoring Loop Works

  1. Write your prompt as a template with variables instead of hardcoded examples. If your prompt says “here is a product review about headphones,” replace that with a placeholder. Hardcoded examples teach the model to expect one kind of input, which is exactly how blind spots form.
  2. Build 5 to 10 test cases. Each needs an input and a description of what a good output looks like. Don’t just think “this output should be good.” Write down the specific criteria: length, tone, structure, what it must include, what it must avoid. Vague criteria produce vague scores.
  3. Run the prompt on all of them. Score each output 0 to 10 against your definition of “good.” Be consistent. If you’d give a 6 to a response that’s technically correct but too formal, give every similar response a 6. Inconsistent scoring turns your baseline into noise.
  4. Average the scores. That number is your baseline. Write it down. The engineer’s was 2.32. Whatever yours is, it’s honest information, which makes it more valuable than any gut feeling.
  5. Improve the prompt. Re-run. Compare. Change one thing at a time if you want clean signal. Change three things at once and your score jumps from 4.1 to 6.8, but you have no idea which change actually did the work. Now you’re back to vibes.

The loop sounds tedious until you realize you stopped guessing. Now you’re engineering.

💡 Tips That Make This Way Faster

Use an AI as your scorer. The community spotted the obvious shortcut right away: ask one AI to evaluate the outputs of another. Feed it your test cases, your criteria, and the raw outputs. It handles the scoring. You handle the prompt changes. Cuts the manual work significantly. You can even ask the scorer to explain why it gave each score, which gives you a ready-made list of what to fix next.

Write your scoring rubric before you run anything. It sounds backwards, but defining “what does a 9 look like” before seeing any outputs forces you to think clearly about what you actually want. Most prompt failures happen because the engineer never defined success precisely enough in the first place.

Don’t do this for everything. The engineer said it himself: not every use case needs this level of rigor. Save it for prompts that power real workflows or agents where bad outputs cost actual time or money. A one-off summarization you run twice a month doesn’t need a test suite. A customer-facing email generator that fires fifty times a day absolutely does.

Per-case failures are the real gold. Your average score matters less than knowing which inputs keep breaking. A 7.5 average with three recurring failure types is more useful than a clean 8.0 you can’t explain. The failure patterns tell you exactly where to put your next hour of work.

🚀 Try It This Week

Pick one prompt you use regularly. Something that drives real output for you. Write five test cases, score the results, and see what number comes back.

Make at least two of those test cases slightly awkward inputs. An unusually short input. A messy one with inconsistent formatting. The kind of thing a real user would actually send. That’s where you’ll learn the most.

There’s a good chance it surprises you. And that surprise is exactly the data you need to build something that actually works.

Frequently Asked Questions

Q: When is this evaluation process actually worth the time?

Use it for anything production-grade or repeatedly run, AI agents where output quality really matters. Quick experiments? Skip it. But if you’re building something you’ll rely on, the upfront work defining test cases and scoring pays off because you’ll see exactly which prompt changes move the needle, not just which “feels better.”

Q: How do I keep my scoring consistent so a 7/10 means the same thing throughout?

Write your rubric and examples before you score anything. Be explicit about what separates a 7 from a 5. Pro tip: some people randomize which version they’re scoring to avoid unconsciously favoring one prompt. Consistency matters more than getting the “right” score.

Q: Can I use AI to evaluate the outputs instead of doing it manually?

Yes, one commenter suggested having one AI model score another’s output. It’s faster and less tedious at scale, especially if you have a clear rubric. Trade-off: AI scoring is less nuanced than human judgment, but it usually correlates well with manual scores and works great for high-volume evaluation.

Q: Why do prompts usually fail the same cases repeatedly?

Look for the pattern, what do those failing cases have in common? Unusual phrasing, ambiguous input, an edge case, or a specific format often points to one weak spot in your prompt. Once you identify it, rewrite that section instead of tweaking everywhere. Usually gives bigger jumps in your score.

I stopped guessing whether my prompting was any good and started scoring it
by u/Old_Organization1183 in PromptEngineering

Scroll to Top