Sparse Attention: Subquadratic's LLM Cost Breakthrough

A startup called Subquadratic says it has broken through one of the biggest bottlenecks holding back large language models, according to MIT Tech Review. The company’s system, SubQ, won’t replace today’s top models across the board. But MIT Tech Review reports it could deliver huge gains in speed at a fraction of the usual cost for certain tasks, and its founders believe the approach could eventually reshape how LLMs get built.

“We hope we’re kicking off a new age of efficiency,” cofounder and CEO Justin Dangel told MIT Tech Review. “We don’t think anybody will be building on transformers in a few years.” That’s a bold claim. What stands out is the target: the core math that makes modern AI so expensive to run.

The bottleneck, explained

Nearly every LLM today runs on a neural network called a transformer, the architecture introduced in Google’s 2017 paper “Attention Is All You Need.” The engine inside it is a process called dense attention, and it’s the reason these models burn so much power.

Here’s how it works. When a transformer reads a chunk of text, it turns each word or token into a number. To capture meaning, it then multiplies every number against every other number in that text. The math piles up fast:

A 10,000-word passage kicks off close to 50 million separate multiplications.
Double the length of the text, and you roughly quadruple the computation.

That pattern is called quadratic expansion, and it’s why long documents are so costly to process. “If you want to summarize The Great Gatsby, you have to look at the first word and the last word together, and then you have to look at every other combination,” Dangel said in MIT Tech Review’s report.

Picture dots around a circle, each one a token, with lines connecting every pair. Five dots make 10 lines. Ten dots make 45. Twenty dots make 190. The cost climbs much faster than the text grows.

What Subquadratic changes

The fix is in the company’s name. Instead of dense attention, SubQ uses sparse attention, which cuts the number of computations by skipping most of them. Rather than multiplying every token by every other token, it picks out only some of the relationships to calculate.

The bet behind it is simple: not every word in a passage actually matters to every other word. Most of those 50 million multiplications are noise. If you can reliably keep the connections that carry meaning and drop the rest, you get the same useful output for a small slice of the compute.

Sparse attention isn’t a brand-new idea in AI research. What Subquadratic is claiming is a version efficient enough to matter in production, fast enough and cheap enough to challenge the dense-attention default that has ruled since 2017.

Why it matters

Compute cost is the tax on everything in AI right now. It drives the price of every API call, the size of data center buildouts, and the energy bills that have utilities scrambling. Anything that genuinely lowers the cost curve for long-context work hits the industry where it spends the most.

A few things worth watching:

Speed and cost for specific jobs. Subquadratic isn’t pitching SubQ as a universal replacement. The early win is certain tasks, likely long-document work where quadratic math hurts most.
The architecture question. Dangel’s claim that nobody will build on transformers in a few years is the real headline. Plenty of researchers have hunted for transformer successors, from state-space models to other sparse approaches, and the transformer keeps winning.
Proof at scale. Sparse attention often trades a little accuracy for a lot of speed. The test is whether SubQ holds quality on real workloads, not just benchmarks.

My take: treat the “end of transformers” framing as ambition, not a forecast. The efficiency angle is the part to take seriously. If Subquadratic can show steep cost cuts on long-context tasks without wrecking output quality, it gives teams a real reason to look past dense attention for the first time in years.

For now, the claim is the news, and the proof is still ahead. You can read the full breakdown at MIT Tech Review.

Read original article

The bottleneck, explained

What Subquadratic changes

Why it matters

Related: