AI Consistency Solved: The LLM Batch Size Breakthrough

It looks like one of the biggest and most frustrating problems with large language models might have a solution. You know how you can ask an LLM the same exact question twice and get two completely different answers? That’s called non-determinism, and it makes models incredibly difficult to debug, audit, or truly rely on for consistent tasks. But now, it seems this innovator and their team at Thinking Machines have published some fascinating research that pinpoints the real cause, and it’s not what most people thought.

I was blown away when I read their findings! For a long time, the common belief was that this randomness came from complex math rounding errors and the way GPUs process tasks simultaneously. While those things play a part, the real culprit, according to this expert, is something much simpler: the “batch size.”

Think of it like a carpool. When you send a prompt to an AI, it gets bundled with other people’s prompts into a “batch” to be processed efficiently. The size of that carpool constantly changes depending on how busy the system is. The mind behind it discovered that this changing batch size quietly alters the order of the tiny mathematical calculations inside the model, leading to slightly different results every time.

So, what did this talented creator and their team do about it? They developed a way to make the process consistent, no matter the batch size. Here’s a breakdown of their brilliant findings.

💡 The Big Breakthroughs

The Problem is the Batch: The core discovery is that varying batch sizes change the order of operations. A bigger batch means a different sequence of internal math compared to a smaller one, which ultimately changes the next word the model predicts. It’s a tiny ripple that causes a huge wave of inconsistency.
The Fix is Consistency: The solution proposed by the original poster is to make the processing “batch invariant.” This means forcing the model to handle its internal math in the exact same order, every single time, regardless of whether the carpool has 2 prompts or 200. It might be a tiny bit slower, but the consistency it creates is invaluable.
The Results are Perfect: Did it work? Absolutely. The team ran a test using a model called Qwen 2. They generated 1,000 completions for the same prompt with randomness settings at zero. Without their fix, they got 80 unique answers. With their fix enabled, all 1,000 completions were completely identical. That’s perfect reproducibility!

This is a massive step forward for making AI more dependable and trustworthy, especially for scientific research, software development, and any field where getting the same output for the same input is critical.

For a much deeper dive into the math and methodology, you should definitely check out the full research post from the person who shared it.

💡 The Big Breakthroughs

Related: