An open-weights model from Chinese startup Moonshot AI just outranked Claude Opus 4.7, GPT-5.5, and Gemini Pro 3.1 in a head-to-head programming tournament. According to Hacker News, day 12 of Rohana Rezel’s ongoing AI Coding Contest pitted ten major language models against each other on the Word Gem Puzzle, and Kimi K2.6 took the crown with a 7-1-0 record and 22 match points.
What stands out here isn’t just the upset. It’s that two specific Asian models, Kimi K2.6 and Xiaomi’s MiMo V2-Pro, separated themselves from every Western frontier lab. This isn’t a clean China-versus-West narrative either. Zhipu AI’s GLM 5.1 landed fourth and DeepSeek V4 finished eighth. Two models won. The rest scrambled.
The challenge
The Word Gem Puzzle is a sliding-tile letter game on grids ranging from 10×10 to 30×30. Bots slide tiles into a blank space and claim valid English words formed in horizontal or vertical lines. Scoring rewards length and punishes brevity: three-letter words cost three points, while an eight-letter word earns two. Each pair of models played five rounds with a ten-second wall-clock limit per round. On smaller boards, seed crossword words mostly survive the scramble. On 30×30 grids, almost nothing survives, so reconstruction through actual sliding becomes the only path to points.
Nvidia’s Nemotron Super 3 produced code with a syntax error and never connected. Nine models actually competed.
The results
- Kimi K2.6: 22 match points, 7-1-0 record
- MiMo V2-Pro: 20 match points, 6-2-0 record
- GPT-5.5: 16 match points, 5-1-2 record
- GLM 5.1: 15 match points, 5-0-3 record
- Claude Opus 4.7: 12 match points, 4-0-4 record
- Gemini Pro 3.1: 9 match points, 3-0-5 record
- Grok Expert 4.2: 9 match points, 3-0-5 record
- DeepSeek V4: 3 match points, 1-0-7 record
- Muse Spark: 0 match points, 0-0-8 record
What actually happened
Kimi won by sliding aggressively. Its code scored each possible move by what new positive-value words it could unlock, executed the best one, and repeated. The strategy had flaws (some inefficient back-and-forth oscillation on smaller grids) but the sheer slide volume paid off on 30×30 boards. Cumulative score: 77, the highest in the tournament.
MiMo took the opposite approach. Its sliding logic existed in the code but never triggered. Instead, it scanned the initial grid for seven-letter-plus words and fired all claims in a single TCP packet. Brittle, but devastating when seed words survived.
Claude and Grok also didn’t slide. They held up on smaller grids, then fell apart when 30×30 boards demanded actual tile movement. GPT-5.5 played conservatively with about 120 slides per round and posted strong numbers on 15×15 and 30×30. GLM was the most aggressive slider in the field with over 800,000 total moves, but stalled when it ran out of positive plays.
The cautionary tales
DeepSeek sent malformed data every round and produced zero useful output. Muse Spark went the opposite direction and made things actively worse: it claimed every valid word it could find, ignoring the penalty for short words. On 30×30 boards with hundreds of short valid words visible, Muse carpet-bombed the dictionary. Final cumulative score: minus 15,309. A version of Muse that simply connected and did nothing would have scored 15,309 points higher.
Why this matters for practitioners
A few practical takeaways for anyone deploying these models on structured tasks:
- Read the spec, all of it. Muse executed a partial reading of the rules in full. It saw “claim valid words” and missed “short words cost points.” That’s a real failure mode in production.
- Open-weights are catching up fast. Kimi K2.6 is publicly available from Moonshot AI. MiMo V2.5 weights are dropping soon per Xiaomi.
- Strategy matters as much as raw capability. GPT-5.5 placed third with a more conservative approach than Kimi’s brute force. Different problems reward different policies.
Full move logs and methodology are available at the original source.