Picking an AI for creative explanations comes down to one thing: can it map a complex concept into a world that has nothing to do with it?
That’s the real test. Not accuracy on benchmarks, not response speed. Not how confidently the model sounds or how long its answer runs.
A Redditor named u/zemzemkoko from r/PromptEngineering ran exactly this kind of experiment, using a custom NPC prompt to ask three models to explain quantum computing to a medieval blacksmith. The results reveal something genuinely useful about how each model handles creative constraint.
The Setup: What Makes a Good Analogy
A useful AI analogy does two things. It maps the core mechanic of the concept, not just the vibe, and it stays inside the world of the person you’re explaining to. A blacksmith doesn’t know electrons or quantum states. But they know iron, heat, and that moment before the hammer falls when metal can still become anything.
That’s the criteria here. How well does each model use only the tools that world provides? A weak analogy reaches outside the metaphor and borrows modern concepts the character wouldn’t have. A strong one finds the underlying physics hiding inside an experience the character already lives.
The Comparison
- Gemini: ⚔ “A cursed forge where the iron is both sword AND horseshoe”
- Claude: “An anvil that is somehow both hot AND cold until you touch it”
- GPT: 🔥 “Qubit = heated metal before the strike”
Breaking it down:
Gemini went for atmosphere. “Cursed forge” is vivid and memorable, but it relies on magic rather than physics. Superposition becomes a curse instead of a quantum state. The imagery lands emotionally, but a blacksmith who heard this would nod politely and go back to work. No real understanding transfers, just a feeling that quantum computing is spooky. That works well for storytelling and engagement. It works less well when the goal is comprehension.
Claude went for precision. The anvil being both hot and cold captures the paradox of superposition while staying inside the blacksmith’s world. It’s accurate to the underlying physics and doesn’t cheat by borrowing modern concepts. The limitation is that it stays abstract. You sense there’s a paradox, but you don’t have a concrete mental model to hold onto. There’s no action, no moment in time, nothing the blacksmith can picture themselves doing.
GPT landed the most useful framing. “Heated metal before the strike” isn’t just poetic, it’s mechanically correct. A qubit before measurement is like iron in that suspended moment before the hammer falls: still in a state of possibility, not yet one thing or another. The moment you observe it (measure it, in physics terms), it collapses into a definite state. A blacksmith would actually understand this, because they live that exact moment every time they work.
One commenter put it bluntly: “the qubit explanation is actually the most accurate one by accident.” It wasn’t an accident. It was GPT staying close to the physics while working inside the metaphor.
The Recommendation
For pure creative reframing, use GPT. Its answer maps mechanics, not just mood.
For explanations that need to feel immersive or emotionally resonant, Gemini is strong. Just expect some fantasy-novel energy when the subject gets technical.
For precision-inside-metaphor, Claude performs well. It’s the right call when your audience needs to understand the logic, not just feel surprised by it.
How to Replicate This Test
- Write a 2-3 sentence description of the NPC’s world (what they know, what technology exists, what doesn’t exist yet)
- Tell the model to explain your target concept using only things from that world
- Judge the output on two axes: does it stay inside the metaphor, and does it map the actual underlying mechanic? If the model breaks the frame or smuggles in a modern reference, mark it down.
Run the same prompt across all models without changing anything between tests. That’s the only way to get a fair read.
u/zemzemkoko’s original NPC prompt is available on request, and the full comparison with actual conversation logs is linked in the original Reddit thread.
Why This Matters Beyond the Test
The ability to explain complex ideas inside unfamiliar frames is one of the most valuable things an AI can do. Think customer education, onboarding docs, sales materials, training content. All of these require translating technical concepts for people who don’t share your background. A SaaS company explaining machine learning to non-technical buyers faces the exact same challenge as explaining quantum physics to a medieval blacksmith.
If you’re choosing an AI for that kind of work, this test tells you more than most standard benchmarks. It shows you how each model handles constraint, how it balances creative flair with conceptual accuracy, and whether it can tell the difference between sounding right and actually being right.
The full conversation and comparison results are in the original thread over at r/PromptEngineering. The NPC prompt is worth trying with your own concepts.
Frequently Asked Questions
Q: How accurate are these medieval analogies for actually learning quantum computing?
They’re fun entry points, not technical explanations. One commenter noted that GPT’s qubit explanation happened to be the most accurate by accident, while Gemini went “full fantasy novel.” Use these analogies to spark curiosity, but pair them with real technical resources if you want to actually understand quantum mechanics.
Q: Why do AI models respond so differently to the same creative prompt?
Each model has different training, architecture, and optimization priorities. That’s why Gemini leaned poetic, Claude used physics intuition, and GPT stayed grounded. The diversity is actually useful, it shows you each model’s unique thinking style and helps you figure out which one fits your needs.
Q: Is the “Blacksmith Test” actually a good way to evaluate AI?
Creative prompts are great for exploring how models explain concepts and show off personality, but they’re not a substitute for formal benchmarks. Readers flagged a real issue: the post was too brief for real depth, while the full comparison felt overwhelming. Good evaluation needs balance, enough detail to learn something without making people hunt through pages.
Q: When should I use creative prompts versus benchmarks to pick an AI model?
Creative prompts show you personality and lateral thinking, which matter if you care about explanation quality. But if you’re evaluating for accuracy or technical tasks, also run formal benchmarks. Think of creative tests as a way to understand how models think, not just whether they’re right.
I asked 3 AI models to explain quantum computing like I’m a medieval blacksmith
by u/zemzemkoko in PromptEngineering