Anthropic's Fable Model: Benchmarks, Quirks & Workflows

So yesterday Anthropic shipped the model they once said was too dangerous to release. They built it, sat on it since January, and then quietly put it out on June 9th. I stumbled on a breakdown of it from Matthew Berman, an AI creator who got early access to the thing, and honestly his hands-on notes shifted how I think about where coding agents are headed.

Here’s the quick map before we dive in. The new family is called Fable, and it sits next to Haiku, Sonnet, and Opus. It’s trained in the same class as the “Mythos” model. The only real difference: Mythos has its guardrails stripped off (handed to security researchers to harden software), while Fable keeps the guardrails on for everyone else. Same 10 trillion parameter brain underneath.

What’s new

The creator walked through the benchmarks first, and the gaps are wide. A few that stood out from his rundown:

🧪 SWE-Bench Pro: Fable/Mythos hit ~80%, versus Opus 4.8 at 69% and GPT 5.5 at 58%.
Frontier Code (Diamond): 29.3%, roughly double Opus 4.8 and miles past GPT 5.5’s 5.7%.
GDPval (real-world knowledge work): edged out everything, with strong spatial reasoning where Claude used to lag.
Computer use: 85%, plus jumps on tool use, legal agent tasks, and Terminal Bench.

He’s honest that benchmarks can mislead. But his lived experience backed them up: he couldn’t even invent a test hard enough to trip it. The last time a model truly surprised him was Gemini 2.5 Pro solving a scrambled Rubik’s cube. Fable cleared that bar without breaking a sweat.

The twist

Now the part that’s actually unexpected. The original poster says the model is almost too eager. Give it a tiny task and it treats it like a massive expedition. It wanted to crawl his entire codebase, weigh every angle, even peek at projects he hadn’t touched in years. Small prompts stopped feeling small the second he hit enter.

Two quirks really got him:

It’s painfully verbose. The output is so information-dense he had to slow his reading way down. He kept editing his config file begging it to “explain it like I’m five.” He admitted it made him feel dumb, which he’d never felt with another model.
It won’t stop asking questions. One prompt would spawn three to five clarifying questions, then a summary to confirm, then “can I write a spec?”, then “is the spec correct?”, then “should I run agents in parallel or sequentially?” By the time it built anything, he was ready to scream. But when it finally built, the result was great.

That density point sparked a wild tangent from him: if models keep packing more meaning into fewer words, future AI might invent its own hyper-dense language only it can read. Huge efficiency win, scary if we can’t follow along. I found that genuinely fun to sit with.

The workflow that unlocks it

The real takeaway from his testing isn’t the raw model. It’s how you drive it. Here’s the mini-playbook I pulled from his notes:

⚙️ Start on the lowest effort setting. He found even “medium” was usually overkill. Extra effort was just slow and excessive.
Lean on Workflows mode. This spins up a planning agent that delegates to potentially hundreds of sub-agents in parallel. On his fluid-sim test, 63 agents ran at once, each burning 20-30k tokens.
Wrap it in a loop. He references the “loops” idea (the layer above agentic coding) where the model keeps burning tokens toward a goal until it’s done.
Route your tasks. Fable costs $10 per million input tokens and $50 per million output. Save it for your hardest problems. Send everything else to Sonnet or Haiku.
Watch the slow start, then buckle up. He’d see it sit near 1,500 tokens for five to eight minutes, then explode to 1.5 million in 30 seconds.

The receipts back the spend. During early testing, Stripe reportedly used it to run a codebase-wide migration on 50 million lines of Ruby in a single day, work that would’ve taken a team over two months by hand. At that scale, $50 per million output tokens is a bargain.

Pro tips

💡 Token budget has no obvious ceiling. The creator points to research suggesting you can keep throwing thinking tokens at a problem and the output keeps improving. Fable is especially hungry to use them, so cost control is on you.
Distillation gets blocked. Try to extract Fable’s capabilities to train a competitor and your requests fall back to old Opus 4.8. He found that genuinely funny.
Safeguards are light in practice. Anthropic says filters trigger in under 5% of sessions; he didn’t hit a single false positive.
Don’t sleep on routing. Knowing which task goes to which model is the skill that saves you from those eye-watering API bills companies are already getting.

His bigger point landed hard for me: there’s a real “model overhang” here. The tools are already more powerful than almost anyone knows how to use, and software factories built from loops plus workflows plus a model this autonomous are closer than people think.

Worth a watch

If you build with agents at all, the full breakdown is worth your time. The creator demos a gorgeous 3D Rubik’s cube and a real-time fluid simulation, plus Fable beating Pokémon Fire Red using vision alone. Go check out his video for the live tests and the full benchmark walk-through. 🚀

What’s new

The twist

The workflow that unlocks it

Pro tips

Worth a watch

Related: