Trying to keep up with AI news feels like drinking from a firehose, right? Well, get ready for a tidal wave. I just finished watching the livestream from the xAI team, and they officially unveiled Grok 4. It’s not just an update; it’s a whole new level of intelligence.
This AI professional explained that the leap from Grok 2 to 4 involved a staggering 100x increase in training compute. But here’s the kicker: the big secret sauce for Grok 4 was a massive focus on post-training with reinforcement learning, using what they call “verifiable rewards.”
🧠 Superhuman Reasoning on Display
To prove its smarts, the team at xAI threw Grok 4 at the notoriously difficult “Humanities Last Exam” (HLE), a benchmark so hard that even top AI models only score in the single digits. These are PhD-level problems that most human experts can’t solve.
Here’s how Grok 4 did:
- Without tools: It solved over a quarter of the problems (~27%). Impressive.
- With tools: It got even better.
- Grok 4 Heavy: This is where it gets wild. The mind behind it explained this is a multi-agent version where several instances of Grok tackle a problem together, compare notes, and find the best solution. This version solved over 50% of the HLE problems. Insane.
⚙️ It’s Not Just Book-Smart
The team didn’t just stop at benchmarks. They showed off some real-world applications that blew me away.
- 📌 Advanced Tool Use: The creator demonstrated Grok 4 visualizing two black holes colliding by searching the web and reading academic texts. It even predicted World Series odds by analyzing betting markets to find its own “alpha.”
- 📌 The Vending Machine Challenge: They had folks from Endor Labs test Grok 4 on a simulation where it had to run a vending machine business. The result? It generated double the net worth of the next-best model because it could stick to a long-term strategy.
- 📌 Real-Time X Integration: One of Grok’s biggest advantages is its native access to X (Twitter). The person who shared it asked it to find the “xAI employee that has the weirdest profile photo,” and it correctly identified Greg Yang. It understands concepts like “weird”!
✨ New Features & API Access
xAI is making this power accessible right away.
- ✅ API Available Now: Developers can start building with Grok 4 immediately. It has a 256k context window and real-time data search.
- ✅ Improved Voice Mode: They showcased new, incredibly natural-sounding voices with much lower latency, making conversations feel snappy and smooth.
- ✅ Dominant Benchmarks: On top of the HLE, Grok 4 scored a perfect 100% on the AIME math exam and solidly beat competitors like Claude 4 Opus and Gemini 2.5 Pro on several other key reasoning benchmarks.
🚀 What’s Next?
This innovator was clear that this is just the beginning. They openly admitted that image understanding is a current weakness, but their next foundation model (version 7) is already training to fix that. They even dropped a roadmap:
- August: New Coding Model
- September: Multimodal Agent
- October: Video Generation
I think the ability for the “Heavy” version to use multiple agents is a true game-changer for complex problem-solving. This is one of the most exciting AI releases I’ve seen all year.
For the full deep-dive and to see all the demos for yourself, make sure to watch the original video from the creator!