Why small AI models cramp their own thinking

Small language models handicap themselves before training even begins, and a new paper flagged on Hacker News shows how to fix it. Researchers identified a geometric flaw they call “embedding condensation,” where a small model’s token representations collapse into a narrow, cone-shaped slice of the available space. According to Hacker News, where the paper climbed to 169 points in the Research category, the team then built a training objective called “dispersion loss” to push those representations back apart. The payoff: smaller models start behaving more like their larger siblings without adding a single parameter.

Here’s the core idea. Inside a Transformer, every token gets turned into a vector. As those vectors pass through layer after layer, you want them pointing in varied directions so the model can tell tokens apart. In small models, they don’t. They bunch up toward near-parallel directions, which wastes the representation space and dulls the model’s expressivity.

📊 What the researchers found

The team ran cosine-similarity heatmaps across several model families and saw a clean pattern:

  • Smaller models condense more. GPT2 and Qwen3-0.6B showed token similarities creeping positive in deeper layers.
  • Larger models resist it. GPT2-xl and Qwen3-32B held their representations apart.
  • The trend held across datasets. Wikitext, PubMedQA, IMDB, and SQuAD all produced the same result, measured with Spearman correlation and Kendall’s Tau.

To rule out coincidence, they pre-trained four GPT2-like models that differed only in MLP dimension, keeping layers, embedding size, dataset, and training config identical. The bigger the MLP, the less condensation. Model size itself was driving the effect.

Two findings stand out as counterintuitive. First, condensation shows up right at model initialization, before any training. Checkpoints of Olmo-3-1025-7B revealed that pre-training actually eases the problem over time, not worsens it. Second, knowledge distillation from a bigger teacher model doesn’t transfer the resistance. You can’t simply copy a large model’s good behavior into a small one.

🔧 The fix: dispersion loss

Dispersion loss is a regularizer that spreads token embeddings out across the unit hypersphere, enforcing more uniform angular spacing. It’s inspired by the “Diffuse and Disperse” paper, with practical tweaks for numerical stability (a log-sum-exp trick) and to stop embeddings from expanding without limit. The team also tested alternative formulations: decorrelation loss, an L2-repel loss, and an orthogonalization loss that only nudges vectors sitting at acute angles.

Applied during mid-training and pre-training, dispersion loss reversed the collapse. Starting from already-condensed embeddings, standard mid-training barely moved the needle. Adding dispersion loss as a regularizer substantially opened the representations back up.

💡 Why it matters for practitioners

This reframes a comfortable assumption. Bigger models aren’t better only because they have more parameters. Part of their edge comes from how they organize their representation space, and that organization can be encouraged directly.

For anyone training or fine-tuning small models, the practical takeaways are concrete:

  • A cheap regularizer may narrow the gap. Dispersion loss targets a specific structural weakness without growing the parameter count, which matters if you’re deploying on tight compute or edge hardware.
  • Don’t lean on distillation alone for this. The paper shows it won’t fix condensation, so pair it with a dispersion-style objective if representation quality is the goal.
  • Watch initialization, not just training curves. Since the problem is present from step zero, the way you set up a small model may deserve as much attention as how you train it.

A fair caveat: the researchers frame this as an observation-driven improvement, and the headline evidence sits on the geometry of embeddings plus qualitative before-and-after results. The heavier quantitative gains live in the full paper, and how much dispersion loss lifts real downstream benchmarks across model families is the number practitioners will want to pin down next.

The deeper message is worth sitting with. Scale buys better representation geometry, but geometry might be trainable on its own terms. If that holds up, the ceiling for small models could be higher than we assumed. Full methodology and the quantitative tables are available at the original source.

Scroll to Top