Why LLMs Fail at Reasoning: Neurosymbolic AI Works

A new study from Tufts University adds fresh evidence that large language models struggle with reasoning and planning, and that hybrid approaches mixing neural networks with classical AI techniques perform dramatically better. Marcus on AI highlighted the paper, connecting it to last year’s influential Apple reasoning study that exposed LLM weaknesses on the Tower of Hanoi problem.

The original Apple paper, The Illusion of Thinking, showed that LLMs could handle the Tower of Hanoi with a small number of disks but fell apart as complexity grew. The Tufts research picks up that thread and pushes it further in three important ways.

What the Researchers Did

Timothy Duggan, Pierrick Lorang, Hong Lu, and Matthias Scheutz tested Vision-Language-Action models (VLAs), a newer LLM variant increasingly used in robotics, on the same Tower of Hanoi tasks. They then compared those results against a neurosymbolic hybrid that pairs a neural network for perception with a symbolic planner for reasoning.

The Results

The numbers tell a clear story:

3-block task: The neurosymbolic model hit 95% success. The best VLA managed just 34%.
4-block task (unseen during training): The neurosymbolic model generalized to 78% success. Both VLAs failed completely.
Energy efficiency: The hybrid approach used nearly two orders of magnitude less energy than the VLA models.

That last point matters more than it might seem. If AI systems are going to run in robotics and real-world applications at scale, energy costs aren’t academic. A 100x efficiency gap is the difference between practical deployment and burning through compute budgets.

Why This Matters

The study reinforces what the Apple paper argued last year: LLMs can fool you into thinking they’ve solved a problem when they’ve only memorized patterns for simple cases. Humans who learn the Tower of Hanoi develop generalizable strategies. LLMs don’t.

As Marcus on AI puts it, “LLMs are an efficient way to pattern recognition where perfect results are not required, but an inefficient way to reason a plan. Different tools for different jobs.”

This isn’t just a theoretical debate. If you’re building AI systems that need reliable planning, reasoning, or generalization, pure LLM approaches have a documented ceiling. The research suggests combining neural networks (great at perception and pattern matching) with symbolic planners (great at logical reasoning) produces systems that are both more capable and more efficient.

The Caveats

Marcus on AI is careful to note that neurosymbolic AI isn’t a magic fix. The Tufts model is purpose-built for this specific task. What the field really needs is a general-purpose system that can figure out which tools to deploy for any given problem. That doesn’t exist yet.

He also points out that even impressive current systems like Claude Code 4.6 “still makes plenty of mistakes, and still can’t be trusted” as anything more than a tool. The honest assessment from practitioners he trusts: genuinely useful, but not a complete solution.

What Practitioners Should Take Away

Don’t assume LLM success on simple cases means it generalizes. Test at scale before trusting.
For planning-heavy applications, explore hybrid architectures that pair neural perception with symbolic reasoning.
Watch the energy math. A 100x efficiency difference compounds fast in production systems.
The research direction is clear. Multiple independent studies now point the same way: pure LLMs hit walls on reasoning tasks that hybrid approaches handle better.

The broader question Marcus on AI raises is about where the industry’s next trillion dollars should go. More scaling of LLMs, or investment in hybrid approaches that show stronger generalization? The evidence keeps tilting toward the latter. You can find the full Tufts paper and Marcus’s analysis at the original source.

Read original article

What the Researchers Did

The Results

Why This Matters

The Caveats

What Practitioners Should Take Away

Related: