METR Benchmark: Why AI Breakthrough Hype Isn't Yet Real

A new METR benchmark dropped this week and the AI Twitter crowd lost its mind. The chart showed Anthropic’s upcoming Claude Mythos Preview hitting a 50% success rate on software tasks that would take a human roughly 16 hours, with a confidence interval stretching up to 55 hours. According to Marcus on AI, the reaction was a wave of panic posts claiming the graph had been “broken” and that superintelligence was around the corner. Gary Marcus says the panic is misplaced, and his argument is worth examining.

What METR actually measured

METR’s “time horizon” graph tracks the length of software engineering tasks that frontier models can complete, normed against human engineers. The progression looks dramatic on paper: one minute, then two, then four, then eight, doubling its way up to sixteen hours.

But Marcus flags two asterisks the panic crowd skipped past:

The 50% bar. The graph measures a 50% success rate, not 90, 95, or 99. The 80% version of the same chart looks far less impressive.
The narrow domain. These are software development tasks. Not general reasoning. Not watching a movie and discussing the plot. Not running a project that lasts months.

“The key problem with GenAI has been reliability,” Marcus writes, “and a graph that demands only 50% success does not address reliable performance. At all.”

The trillion pound baby fallacy

Marcus coins a useful phrase for what’s happening here: the trillion pound baby fallacy. A baby doubles its weight in four months. Nobody assumes it will keep doubling until college. Yet that’s exactly the logic behind extrapolating METR’s curve into AGI, or projecting Anthropic to $2 trillion in revenue by 2030.

Exponential curves bend. They always do. The question is just when and why.

Why the gains are happening

This is the part most takes miss. The recent jumps in benchmark performance aren’t coming from pure model scaling. Marcus argues they’re coming from symbolic tooling bolted onto the LLMs: code interpreters, verifiers, harnesses, formal checks.

That matters for two reasons:

It’s a quiet vindication of neurosymbolic AI, not proof that scaling alone keeps working.
These tools work best where formal verification applies cleanly. Coding and math. Not world modeling, not hallucination reduction, not the messier parts of human reasoning.

Ramez Naam’s separate analysis backs this up. On the broader ECI benchmark, Mythos sits roughly on trend, only slightly above GPT 5.4. No acceleration. The breakthrough vibe lives mostly inside the narrow METR frame.

What practitioners should take from this

A few practical reads for anyone building with or around frontier models:

Treat single-benchmark leaps with suspicion. Ask what success threshold was used and what domain it covers.
Reliability is still the wall. A 50% pass rate on long tasks means you still need humans in the loop, verification layers, or both.
Domain matters more than headline numbers. Coding agents will keep improving fast. General-purpose reasoning is a different curve entirely.
Budget for the bend. If your business model assumes capability doubling every few months for the next five years, you’re underwriting a baby that weighs a trillion pounds.

Marcus has been making the bear case on pure scaling for years, and his track record on “deep learning is hitting a wall” gets cited often, usually unfairly. His original essay was about scaling limits, not capability limits. The distinction holds up here. Tools and hybrids are doing real work. Scaling alone is not.

The headline read of the METR graph is that AI is about to swallow software engineering whole. The careful read is that benchmarks measuring 50% success on narrow tasks tell us less than the chart suggests, and that the path forward likely runs through hybrid systems rather than ever-bigger models.

Full breakdown at the original source on Marcus on AI.

Read original article

What METR actually measured

The trillion pound baby fallacy

Why the gains are happening

What practitioners should take from this

Related: