New data: Google’s TPU stack hits 16B tokens/min

Token volume on Google’s Gemini Enterprise platform jumped from 10 billion per minute in January to 16 billion per minute now, with enterprise users climbing 40% quarter over quarter. That stat dropped at the end of a sit-down interview with Google Cloud CEO Thomas Kurian, and it reframes the whole “who’s compute-constrained” conversation.

Matthew Berman ran the interview, and the Google Cloud CEO laid out exactly why Google isn’t sweating capacity the way other frontier labs are. The short version: when you own the silicon, the data centers, the energy contracts, and the model, the unit economics stay friendly even when demand goes vertical.

📊 The capacity edge, broken down

The expert walked through the long game Google has been playing for over a decade:

  • 8 generations of TPU shipped (about 11 years of in-house silicon work)
  • Cloud is roughly half of Alphabet’s capex, with Gemini driving a big chunk of growth
  • Data centers are now manufactured in modular units, not built on-site, which slashes deploy time
  • Locked-in real estate, diversified energy sources, and behind-the-meter power generation
  • Best-in-class PUE (power usage effectiveness) so less energy gets wasted per megawatt of compute

Kurian’s line that stuck with me: “It’s better to have your own chips and demand than not having your own chips.”

🛠 Three practical applications worth stealing

The creator pulled out concrete enterprise wins from the conversation:

  1. Signal (German health insurer): Gemini agents cut research time on patient eligibility questions from 23 minutes to a few seconds, with zero layoffs.
  2. American Society for Clinical Oncology: AI surfaces standard-of-care guidelines for 51,000 oncologists, handling overlapping rules (like “can’t prescribe this chemo if patient is also diabetic”) with no hallucinations allowed.
  3. Citi wealth advisor: Gemini’s reasoning and task management gives average earners access to advice that used to be locked behind private banking.

⚙️ The 8th gen TPU split

For the first time, Google split the chip family: 8T for training, 8i for inference. Why? Because inference workloads are exploding (especially agents that run for hours and need persistent KV cache), and the optimization targets are different. The 8i can even run air-cooled so it deploys in more locations, which matters for latency-sensitive agent work.

💡 Tips and pitfalls from the interview

  • Don’t measure engineer productivity in lines of code. Senior engineers ship less code, more function. Kurian’s team uses an internal coding harness called “jet ski” and pairs it with peer review plus AI security scans.
  • Watch the agent VM cost trap. Consumers can’t afford virtual machines running indefinitely, so the next big bottleneck is activating and deactivating VMs cleanly per task.
  • Open source libraries are the first attack surface. Adversaries with capable models scan popular repos first. Continuous red-teaming agents (not monthly audits) are the new baseline.
  • On Mythos and 10-trillion-parameter models: Google says disaggregated serving on TPU handles the largest dense models efficiently, so size isn’t the bottleneck people think it is.

One more nugget I loved: Citadel and the Department of Energy are now using TPUs for non-AI workloads like algorithmic trading, because traditional CPU compute hit the Moore’s law wall.

Watch the full conversation for the deeper dive on Anthropic as a customer-competitor, the cybersecurity strategy with Wiz and Mandiant, and what the next bottleneck looks like.

Scroll to Top