New data: the best AI models in the world score less than 1% on ARC AGI 3, while humans solve it at 100%. That’s not a typo. The gap between human and machine intelligence has never been this stark on a single test.
This isn’t just another leaderboard refresh. It’s a fundamental shift in how we measure progress toward AGI. The creator of this video, Matthew Berman, breaks down exactly what ARC AGI 3 is, why it matters, and why every frontier model basically faceplants on it.
So what’s ARC AGI? It stands for Abstraction and Reasoning Corpus for Artificial General Intelligence. It’s the only major benchmark that hasn’t been saturated by AI models. The whole point is to test generalization, the “G” in AGI. Can a system take a tiny bit of learning and apply it broadly? That’s the question.
📊 A quick history: ARC AGI 1 and 2
The original ARC AGI 1 was straightforward. You’d see a few visual pattern examples, figure out the rule, and apply it to a new case. Think colored squares where you spot what’s missing and fill in the blank. Easy for humans. Hard for AI. But over time, models caught up. The leaderboard shows top models hitting 93-94% now, nearly saturated.
ARC AGI 2 cranked up the difficulty. Same concept but way more complex patterns. The best performer there is GPT 5.4 Pro Extra High at 72%, costing $39 per task. Claude Opus 4.6 sits at 68%, Gemini 3.1 Pro at 69%. Solid numbers, but still nowhere near 100%. And those costs add up fast.
Then came version 3. And everything changed.
📊 ARC AGI 3: the interactive twist
Here’s where it gets wild. ARC AGI 3 ditches static puzzles entirely. Instead, you get dropped into a mini video game. No instructions. No hints. No examples. You just have to figure out what’s going on and solve it within a limited number of moves.
The video walks through an actual playthrough. You see a grid-based maze with a character, a yellow bar (turns out it’s a move counter), a small reference image in the corner, and various interactive elements. The player has to deduce through trial and error that the goal involves navigating to a specific spot, but first interacting with a plus symbol to change the orientation of something on screen. It took a human about a minute to figure it out.
Now here’s the brutal part. The frontier model results:
- GPT 5.4: 0%
- Gemini 3.1 Pro Preview: 0%
- Grok 4.2: 0%
- Claude Opus 4.6: 0%
- Humans: 100%
The top-scoring model overall is GPT 5.4 High at 0.3%, and it costs over $5,000 per task to get there. The video shows GPT 5.4 attempting one of these games, and it just keeps repeating the same moves, never thinking to interact with the plus symbol. It lacks the intuitive leap that feels obvious to any human who’s ever played a video game.
📊 Three practical takeaways from this benchmark
- Cost efficiency matters as much as accuracy. ARC AGI has always tracked cost per task alongside score. Throwing unlimited compute at a problem isn’t intelligence. The benchmark rewards systems that can learn efficiently from minimal information, exactly what real-world AGI would need to do.
- Average humans beat elite AI here. Most benchmarks pit AI against the world’s best coders, mathematicians, and scientists. ARC AGI flips that. Any regular person can solve these puzzles. The fact that AI can’t tells us something important about the gap between pattern matching and genuine reasoning.
- Interactivity exposes a core weakness. Static puzzles let models use brute-force search and chain-of-thought prompting. But interactive environments require real-time hypothesis testing, adapting on the fly, and learning from feedback within a single session. Current architectures just aren’t built for that.
Tips and pitfalls to keep in mind
If you’re building AI products or evaluating model capabilities, ARC AGI 3 is a reality check. Don’t confuse high scores on coding or math benchmarks with general intelligence. Models that ace SWE-bench might completely fail at tasks requiring spatial reasoning and real-time adaptation.
Also worth noting: there’s a $2 million prize for anyone who can saturate this benchmark. That’s serious incentive, and it signals that the ARC AGI team believes this problem won’t be solved easily or soon.
The pitfall? Assuming that because AI handles your specific use case well, it’s close to AGI. ARC AGI 3 shows that the kind of flexible, intuitive reasoning humans do effortlessly is still miles away for current systems. That’s not a reason to panic. It’s a reason to be honest about where we actually are.
Why this benchmark stands out
Every other major benchmark has a pattern: AI struggles, then catches up, then surpasses humans. ARC AGI has resisted that cycle for years. Version 3 just widened the gap dramatically. The interactive format tests something fundamentally different from what transformers are optimized for, and that’s exactly the point.
The games themselves are all unique. No two are the same. You can’t memorize patterns from a training set. You just drop in and figure it out. That’s what generalization actually looks like.
If you want to try the puzzles yourself or dig into the research paper, check out the full video for all the links and details. It’s one of the clearest explanations of why this benchmark matters for the future of AI.