Open AI Models: Unpacking Benchmarks & New Releases

Every major open frontier lab dropped a new model this month, and the U.S. government’s Center for AI Standards and Innovation (CAISI) used the moment to claim the gap between open and closed models is widening. According to Interconnects, CAISI’s V4 assessment ran nine benchmarks through an Item Response Theory (IRT) framework to calculate Elo scores, and the headline finding was bleak for open models. But the methodology is doing a lot of heavy lifting, and that’s where this gets interesting.

What CAISI actually measured

CAISI’s report leans on three benchmarks that swing the Elo math hard: CTF-Archive-Diamond (run on a subset and extrapolated via IRT for DeepSeek V4), PortBench (a CAISI-private benchmark), and ARC-AGI-2 (scored differently than the public leaderboard). Interconnects reports that these three are the main reason DeepSeek V4 looks so far behind. When you swap CAISI’s setup for Epoch AI’s ECI, which also uses IRT but with different benchmarks, the open-to-closed gap has stayed roughly 3 to 7 months since DeepSeek R1.

What stands out here is the harness problem. Both CAISI and ECI evaluate coding tasks with bash access and a fixed token budget. They don’t use Claude Code or OpenCode, which is what these models are actually trained against. The result: benchmarks declare that porting applications between languages is impossible, while in the real world Bun was ported from Zig to Rust across one million lines of code. Florian at Interconnects argues true open-model performance is closer to closed alternatives than the numbers suggest. Nathan thinks the gap is real but agrees the benchmarks are imperfect.

The November release wave

Interconnects catalogs a stacked month of open releases:

Gemma 4 (Google): 4B, 9B, and 31B dense models plus a 26B-A4B MoE. Big news: Google moved Gemma 4 to Apache 2.0, killing the legal ambiguity of its old custom license.
DeepSeek V4: Two sizes: Pro (1.6T-A49B MoE) and Flash (284B-13B). Early consensus says Flash is the real winner; Pro underdelivers for its size. The tech report details architecture changes for cheaper long-context.
Kimi K2.6 (Moonshot AI): Stronger across the board with a focus on long-horizon tasks, important for autoresearch-style agents.
MiMo V2.5 Pro (Xiaomi): Apache 2.0, neck and neck with Kimi K2.6 and GLM-5.1 in benchmarks and real-world use.
GLM-5.1 (Zhipu): Long-horizon improvements across the board.
Qwen3.6-35B-A3B: Update targeting one of the most popular sizes for local deployment.
Laguna-XS.2 (Poolside): First public Poolside coding model, 33B-A3B, with a blog post worth reading on reward hacking in coding evals.
Trinity-Large-Thinking (Arcee): Reasoning variant that’s been topping OpenRouter charts.
LFM2.5-350M (Liquid AI): 28 trillion tokens trained into 350M parameters, likely the most overtrained model out there.

Why this matters for practitioners

The practical takeaway is that benchmark Elo scores aren’t telling you what these models can do in your stack. If you’re evaluating open models, use their preferred harnesses and model-specific prompting before deciding they’re behind. The 33B-A3B and ~26B-A4B sizes are clustering around real local-use sweet spots, and Apache 2.0 licensing on Gemma 4 and MiMo removes a big procurement headache.

The deeper story is that government assessments built on standardized harnesses risk underselling what open-weight models can actually do once you wire them into real agentic loops. Expect more debate as Florian and Nathan keep digging into this disagreement.

Full breakdown at the original Interconnects post.

Read original article

What CAISI actually measured

The November release wave

Why this matters for practitioners

Related: