GPT-5.4 vs Claude: Unified AI Model Benchmark Analysis

Most AI labs ship specialists. One model for code, another for creativity, another for reasoning. OpenAI just shipped a generalist that beats the specialists at their own game.

I’ve been following AI model releases closely, and when I saw this breakdown, I had to stop and reread it a few times. The creator of this video, Matthew Berman, had early access to GPT-5.4 for the past week, and what he found is genuinely worth your attention.

The core idea: stop choosing between models

Until now, OpenAI had a fragmentation problem. GPT-5.2 was the personality and creativity model. GPT-5.3 Codex was the coding powerhouse. Want both? Pick one and accept the trade-off. Meanwhile, Anthropic’s Claude Opus 4.6 had already solved this: world knowledge, logic, personality, and code in a single package.

GPT-5.4 is OpenAI’s answer. As the author describes it, they basically took 5.2 and 5.3 Codex and had them converge into one. The result is a unified flagship model that handles coding, creative writing, agentic workflows, browser use, computer use, and document processing, all without switching tools.

What the benchmarks actually show

The expert ran through the key numbers, and the comparisons are telling:

🖥️ OS World (computer use): GPT-5.4 Thinking scores 75%, vs 74% for GPT-5.3 Codex and 72.7% for Opus 4.6
Swebench Pro: 57.7% for 5.4 Thinking, actually beating the coding-specialist Codex at 56.8%
📊 GDP Val (real-world knowledge work): 83% for 5.4 Thinking, 13 points above 5.3 Codex, and 5 points above Opus 4.6’s 78%
Frontier Math: Also dominated across the board

One detail from the OS World efficiency chart stands out. GPT-5.2 topped out at around 50% accuracy using 42 tool calls. GPT-5.4 hits 75% accuracy with only 15 tool calls. Higher accuracy, fewer tool calls, lower cost per task. That’s not a minor tweak.

Old way vs. new way

Here’s the contrast that matters most. Before 5.4, the workflow for someone building serious agentic systems looked like this: use Codex for coding tasks, switch to GPT-5.2 for writing or reasoning, manage two model configurations, accept gaps where each one fell short.

With GPT-5.4, the author walks through how it handles tasks that previously required switching models:

Reading and extracting data from PDFs and structured documents
Writing and sending emails through Gmail, with visible cursor actions at what appears to be real-time, unaccelerated speed
Bulk data entry extracted from JSON objects, again at real-time pace
Building a fully functional theme park simulation from a single loosely specified prompt, complete with guest happiness logic, funds tracking, park rating, and placeable attractions
Generating a 2D RPG game with full character assets and turn-based mechanics

The simulation demo is the one the author calls possibly the best demo he’s ever seen, and it’s easy to see why. Both games were built from minimal prompts, which suggests the model carries a lot of implicit design judgment without needing detailed hand-holding.

Two variants, one big context window

OpenAI shipped two versions: GPT-5.4 Thinking and GPT-5.4 Pro. Both now include a 1 million token context window, matching Claude’s context length for the first time. Thinking also introduces an upfront planning mode, similar to how Cursor lets you plan before building. The model outlines its approach before burning tokens on execution, which gives you a checkpoint to redirect it before it goes off in the wrong direction.

The pricing reality

This is where things get uncomfortable. The author lays it out directly:

GPT-5.4 input: $2.50 per million tokens (up from $1.75 for GPT-5.2)
GPT-5.4 Pro input: $30 per million tokens (up from $21 for 5.2 Pro)
GPT-5.4 Pro output: $180 per million tokens

Caching helps on input costs, but output remains expensive regardless. Frontier intelligence is getting more capable and more expensive at the same time, not less.

Known weaknesses from early testers

The original poster also includes reactions from other early access users, and they flag real limitations worth knowing before you commit:

Front-end design taste lags behind both Opus 4.6 and Gemini 3.1 Pro
It can miss obvious real-world context. One tester had it plan a trip itinerary that looked perfect on paper but placed the group at locations that would be packed with spring breakers
Inside agentic environments, it reportedly stops short before finishing tasks

Sam Altman responded publicly that these issues are being fixed immediately. Whether that applies to the current release or a near-term patch is unclear.

Practical steps if you want to try it

Pull OpenAI’s official GPT-5.4 prompting guide, already published and available
Either rewrite your existing prompts for 5.4 or maintain two separate prompt sets for 5.4 and Opus
Treat this as a hard requirement: prompting GPT-5.4 is meaningfully different from prompting Claude models, and your existing Claude prompts will not transfer cleanly

The author flags this as one of the most important practical details for anyone switching between model families.

Worth your time

The broader picture is that both OpenAI and Anthropic have clearly figured out their pre-training cycles. Models are shipping faster than ever, and they’re converging on the same goal: one model that handles everything a knowledge worker actually needs. GPT-5.4 is the clearest signal yet that OpenAI is back in that race in a serious way.

Check out the full video for the live demos and benchmark deep-dives. The simulation game alone is worth a few minutes of your time.