AI-Generated Hands Fail: New Data on Structural Errors

New data from a controlled test of ~1,000 images: only 20-25% of AI-generated hands are structurally correct. That means roughly 3 out of 4 fail inspection. And the failures aren’t random. The researcher behind this, u/Driftline-Research on r/PromptEngineering, built a repeatable test to find out whether the “AI can’t do hands” thing is actually measurable, and it turns out it is, very measurably. The scale of the failure rate is what makes this finding different from anecdote. This isn’t “sometimes AI struggles with hands.” It’s a documented, patterned breakdown across hundreds of outputs.

How the test worked

The setup was intentionally stripped down. No complex scenes, no artistic prompts. Just “hand” and “hand isolated”, same model, same settings, run hundreds of times. The point was to remove as many variables as possible and look at raw output volume.

Starting with chairs to test structural stability, the team moved into hands and stayed there. What started as hundreds of images is now approaching 1,000. And the failure patterns kept showing up the same way, every time:

Extra fingers appearing where they shouldn’t
Merged or fused fingers that blur together into a single mass
Multiple hands appearing in a single output
Hands that pass a quick glance but break apart under close inspection

That last category is the most dangerous for production use. A hand that looks fine at thumbnail resolution but falls apart zoomed in will make it through a fast review and into published content. The “near-correct but wrong” failure is harder to catch than an obvious six-fingered disaster.

The researcher’s read on what’s happening: the model appears to be “switching between competing internal hand representations.” It doesn’t settle on one coherent picture of what a hand looks like. That’s why errors repeat in recognizable patterns instead of failing randomly. It doesn’t feel like noise, it feels like a structural problem in how the model represents hands internally.

The team is now scoring outputs and tracking failure types to see whether prompt structure actually shifts those distributions in a measurable way. That data is still coming.

3 practical ways to apply this

🖐 Build a structural review step into your workflow. If you use AI-generated images of people in your content, a visual pass at thumbnail size isn’t enough. Zoom in. Count fingers. Check joints and knuckles. “Looks plausible” is the wrong bar, “structurally correct” is the bar. The failure rate data makes clear these two things are very different. A fast review pass catches the obvious failures. A proper structural check catches the ones that slip through and damage credibility.

✋ Increase prompt specificity to reduce failure surface. The research is pointing toward prompt structure mattering. “A single right hand, palm facing forward, five fingers extended, isolated on white background” gives the model far less room to guess than “hand.” You’re constraining the internal representation it can pull from. It won’t eliminate failures, but it narrows the distribution. Think of it as reducing the number of competing representations the model can land on, not eliminating the problem entirely.

🤚 If you’re generating at volume, plan for a 75-80% rejection rate. Build a rejection checklist: extra fingers, merged knuckles, multiple hands in frame, near-correct hands that fail on inspection. Assign one person or one step in your pipeline specifically to hand QC if you’re producing content at scale. Build this into your pipeline now rather than discovering it when you’re on deadline. The baseline failure rate isn’t going away until models improve at the architecture level.

Tips and pitfalls

What the methodology gets right: Minimal prompts surface failure patterns faster than complex scenes because there are fewer variables to blame. Testing at 100+ images shows pattern consistency that small samples hide. And categorizing failure types early (extra fingers, merged fingers, phantom hands) makes scoring systematic instead of subjective. That structure is worth copying even if you’re running smaller-scale tests. It also makes it easier to communicate the problem to stakeholders who might otherwise dismiss it as isolated incidents.

What to watch out for: Don’t assume “looks right” means “is right.” Audiences notice broken hands even when they can’t articulate what’s wrong. The uncanny valley hits differently when the finger count is off. And don’t expect prompt fixes alone to solve the problem. If the researcher’s hypothesis holds, this is a structural model limitation, not a prompting gap. Better prompts can shift the distribution. They probably can’t fix it entirely.

The most useful finding: Consistent, patterned failure is actually more actionable than random failure. If AI hands failed randomly, you’d have no way to predict or filter it. The fact that the same failure modes keep appearing means you can build a reliable rejection checklist, and that’s a workflow you can use today.

Follow the research

The scoring system and failure-type tracking are still in progress, so the deeper analysis on whether specific prompt structures shift failure distributions is coming. The baseline finding is solid enough to act on now. Head to the original Reddit thread if you want to follow along as the data develops, and if you’re running similar tests yourself, the comments are worth adding to.

We ran ~1000 minimal-prompt hand tests — here’s what showed up
by u/Driftline-Research in PromptEngineering

How the test worked

3 practical ways to apply this

Tips and pitfalls

Follow the research

Related: