Opus 4.7: Features, Benchmarks & Usage Best Practices

So I’m sipping coffee this morning, expecting a quiet Tuesday, and boom: Opus 4.7 lands. Right after last week’s Mythos preview, which Anthropic said was “too powerful” to ship publicly. Then the creator of this breakdown, Matthew Berman, walked through the numbers and honestly, I got pulled into his confusion too.

Here’s the twist the original poster caught: Opus 4.7 isn’t a small bump. It’s a single dot release that closes nearly half the gap to the Mythos model Anthropic refuses to release. So where exactly is the line in the sand? That’s the question this industry pro kept circling, and it’s the one worth sitting with.

🧠 What’s actually new in Opus 4.7

The expert laid out the jumps, and a few stood out as more than incremental:

🔹 SWE-bench Pro leapt from 53.4 to 64.3. Mythos preview sits near 75, so 4.7 basically met it halfway in one iteration.
🔹 SWE-bench Verified moved from 80 to 87, brushing Mythos’s 94.
🔹 Document reasoning shot from 57.1 to 80.6, crushing GPT-5.4 and Gemini 3.1 Pro on the same test.
🔹 Vending Bench (long-horizon coherence running a fake vending business) climbed from roughly $8K to nearly $11K in ending balance.
🔹 Biomolecular reasoning more than doubled, 30 to 74.
🔹 Vision got a real upgrade, especially at high resolution on screenshot navigation tasks.

The one benchmark that went the other way? Cybersecurity vulnerability reproduction dropped slightly, from 73.8 to 73.1. Mythos is at 83.1. The author’s read: Anthropic may have intentionally dulled that capability, and they basically admit it in the model card.

🎯 The Mythos mystery

The mind behind this video floated a theory I found pretty convincing. Mythos is likely a fresh training run, rumored around 10 trillion parameters. Opus 4.5, 4.6, 4.7 are squeezing more juice out of an older, smaller run (maybe a tenth of that size). Anthropic is drawing the release line at Mythos, not at a specific capability score. So every Opus dot release can keep climbing, as long as it stays under the Mythos ceiling on sensitive evals like cyber.

The author also pointed out a business angle worth noting: Anthropic is reportedly at $30B ARR, reportedly doubling fast, and their flywheel is coding. Better coding model, more enterprise revenue, more GPUs, better next model. Rinse, repeat.

🛠️ How to actually use 4.7 (pro tips from the breakdown)

This is where the original poster got practical, and it matters if you’re already running Opus in production:

📝 Retune your prompts. 4.7 follows instructions literally. Old prompts that relied on the model “interpreting loosely” can now backfire.
🚫 Drop the all-caps, drop the “don’t do X” negatives, drop the bold-everywhere habit. Anthropic’s own guide says say what you want, cleanly.
🎚️ Use the new Extra High thinking level. It sits between High and Max, so you can dial reasoning vs latency on hard problems.
👀 Try the /ultra-review command for a built-in separate code reviewer while you code.
💾 Lean on file-system memory. 4.7 is better at keeping notes across long multi-session work, so you need less upfront context.

⚠️ The token crunch nobody’s talking about enough

Here’s the catch this savvy professional flagged. Opus 4.7 ships with a new tokenizer that maps the same input to roughly 1 to 1.35x more tokens depending on content. On top of that, the model thinks more at higher effort, especially on later agentic turns. So you’re paying more tokens for more capability, right as Anthropic is visibly rationing capacity (quota cuts, OpenClaw subscription restrictions).

Translation: watch your spend. If you were running tight loops on 4.6, expect the same workload to cost noticeably more on 4.7. Budget accordingly, and consider routing cheaper subtasks to Haiku or Sonnet.

🤖 The welfare and alignment angle

One bit I found genuinely interesting: Anthropic keeps testing model welfare, looking at internal emotion representations, and 4.7 rates its own circumstances more positively than any prior model they’ve shipped. Meanwhile, Mythos (the unreleased one) is actually the most aligned of the bunch. So the public gets a slightly less aligned, less capable model, and the more aligned, more capable one stays in the lab. Make of that what you will.

💡 Practical takeaway

If your workload is frontend design, long-context work, document reasoning, or agentic coding, 4.7 looks like the new default. If you’re on GPT-5.4 for backend and bouncing to Opus for UI, the creator’s instinct is that 4.7 widens that gap further. Just retune your prompts before you judge it.

Watch the full breakdown for the benchmark-by-benchmark walkthrough and the Mythos theory in the author’s own words.