Most people testing Midjourney prompts pick their favorite image from a batch and call it validated. That’s not a test. That’s a highlight reel.
One person in the Prompt Engineering community figured out why their presets kept failing in production, and built a scoring system that actually tells you whether a prompt works before you bet on it.
Here’s what most people do: generate four images, grab the best one, move on. Here’s why that’s broken: MJ gives you four images every time. If three of them are garbage, your prompt is garbage, regardless of what that one lucky render looked like. You’re not testing consistency. You’re testing whether luck cooperates once. And luck stops cooperating the moment you need it to perform on a deadline, for a client, or across fifty variations of the same scene.
The deeper issue is that cherry-picking trains you to trust the wrong signal. You start believing your prompt is strong when it’s actually just noisy. Then you build on top of it, layer more complexity into a shaky foundation, and wonder why things fall apart three projects later. The failure wasn’t sudden. It was baked in from the first “test.”
The Scoring Framework 🎯
The unit of evaluation is the batch, not the image. But you score each image individually against a checklist, then judge the batch as a whole. The criteria for each image:
- Figure count matches exactly
- Role clarity is readable at a glance
- Silhouettes separate cleanly
- Wardrobe distinctions hold
- Contact and distance between figures is correct
- Scene intent stays intact
Each criterion is there for a reason. Figure count is binary: either everyone in the prompt showed up or they didn’t. Role clarity means a viewer who hasn’t read your prompt can still tell who is who in under two seconds. Silhouette separation matters because merged shapes signal that the model is struggling with spatial relationships, which compounds as scene complexity increases. Wardrobe distinctions ensure that the visual logic you built into the prompt is actually rendering and not getting flattened into generic clothing. Contact and distance catch spatial drift early, before it creates problems in multi-image sequences where continuity matters. Scene intent is the catch-all: does this image actually depict what you asked for, or did the model substitute something adjacent?
Pass threshold: 3 out of 4 images pass, with zero figure-count failures. Why zero tolerance on figure count? Because one missing figure is a different problem than wardrobe drift. You need to track failure type, not just failure rate. A prompt that passes 3 of 4 images but drops a figure in the fourth is not a 75% success. It’s a category failure with a cosmetic pass rate.
Test in a Gray Box First 🧪
Before running a preset against any real-world environment, run it in a minimal gray studio. Plain background. No props. No lighting drama. Featureless.
This eliminates a ton of noise: backgrounds that pull in extra figures, lighting that hides separation, environments that confuse the model. If your prompt can’t hold in a clean box, it’s not ready for a real scene. Think of it as a stress test in the opposite direction: instead of adding pressure, you strip everything away. What’s left is just the core instruction set. If figures merge, drift, or disappear in a gray void, the problem is in the prompt logic, not the environment. That’s actually good news because prompt logic is fixable. Environmental interference is much harder to isolate after the fact.
Running gray box tests also builds a reference baseline. Once you have a passing score in neutral conditions, you know exactly how much a given environment degrades performance. Some scenes drop a prompt from 4-of-4 to 2-of-4. That tells you the prompt needs reinforcement before you use it in a complex setting. Without the baseline, you’re guessing.
How to Apply This Today
- Pick your highest-priority MJ preset, the one you actually use in production
- Run it 3 times in a neutral gray studio environment
- Score every image in every batch against the checklist above
- Count: how many batches hit 3-of-4 with no figure-count drops?
- If it’s under 2 out of 3 batches, your preset isn’t validated, it’s just lucky
If a preset fails step 5, don’t discard it immediately. Look at which criterion failed most consistently. Figure count failures usually mean the scene description is overcrowded or the figure roles are ambiguous. Wardrobe failures often mean the style reference and the character description are pulling in opposite directions. Silhouette failures usually come from proximity cues that the model reads as overlap. Each failure type has a fix. But you can only find the fix if you were scoring in the first place.
This framework retired one preset entirely for this creator. It failed the figure-count test every single run regardless of how the prompt was worded. That’s information you can’t get from picking your favorite image. The lucky render was always there, one per batch, just enough to feel like progress. The scoring rubric made the pattern visible. Three runs, twelve images, zero passing batches on figure count. That’s not a prompt to iterate on. That’s a prompt to rebuild from scratch.
Vibes-based testing feels fast. It’s not. It just moves the failure downstream where it costs more. Build the rubric once and stop re-learning the same lessons with every new project.
Most MJ prompt testing is just vibes. Here’s what a scoring system looks like.
by u/jeffbradshaw in PromptEngineering