AI forecasters have crossed a line the “extending-lines-on-graphs” crowd has been pointing at for years: bots that beat human experts at predicting the future. According to Astral Codex Ten, this year’s prediction market conference wasn’t about the industry going legit or the President’s son joining as an advisor. All eyes were on AI superforecasters, and the numbers people shared were wild. One founder said his AI turned $35 into $2 million on Kalshi in seven months. Another claimed a market-neutral portfolio beating the stock market by 25%.
What stands out here is how ordinary the milestone looks in practice. No dramatic paper, no manifesto. Just AIs quietly making money on prediction markets and outrunning benchmarks by a comfortable margin. That’s what “machines beat humans at forecasting” actually looks like when it arrives.
What an AI superforecaster actually is
Strip away the hype and it’s a frontier model (ChatGPT, Claude, Gemini) wrapped in a “scaffold.” The scaffold hand-holds the model through a long research process: spawning subagents, reading sources, weighing evidence, and pushing it toward a calibrated probability. It feels like using any AI, just slower and pricier because it’s doing far more work.
Astral Codex Ten ran a live test with FutureSearch, one of the companies claiming to beat the market. The question: what’s the chance US colds get cut in half by 2040? In five minutes, the AI deployed three subagents, read 16 websites, cited 212 sources, and landed on 7%. It cost $8 in credits.
The reasoning was the impressive part. It laid out a conjunctive chain where everything has to go right:
- The biology is brutal (200+ cold viruses, 50 years of failed vaccine efforts)
- The lead project’s own timeline runs 5 to 7+ years out on a tight budget
- Adoption is a wildcard for a mild illness people won’t medicate against
- Measurement may be impossible, since the US has no routine cold surveillance
The sanity check
One data point proves nothing, so the piece cross-checked. Preseen, a rival superforecaster, returned 8.8%. A human superforecaster gave 5 to 10%. Three independent forecasters, two of them machines, clustered in the same narrow band. That convergence is the real signal, not any single number.
Measuring forecasting skill at scale is genuinely hard. You can’t grade it on “percent correct,” because that depends entirely on question difficulty. The only honest method is matching forecasters against each other. On Metaculus, which pits AIs against humans on a shared metric, AI was closing in on the “wisdom of crowds” Community Prediction as of May 2026, with Gemini 3.1 leading. Impressive, but still short of elite human pros, at least on that public graph.
Why this matters now
Forecasting sits underneath almost every business decision: pricing, hiring, inventory, risk, capital allocation. If a $8 query can match a professional superforecaster, the cost of a calibrated prediction is about to collapse. That reshapes who gets to make good bets, and how often.
Here’s the thing worth watching over the next one to three years. Prediction markets are a rare domain with an unforgiving, real-money scoreboard. You can’t fake alpha on Kalshi. So this becomes one of the cleanest public tests of whether AI reasoning is actually improving or just sounding smarter. The graph goes up or the money disappears.
Practical takeaways
- Treat probabilities, not answers, as the output. The value is calibration, not a confident yes or no. Ask for the reasoning chain and the sources, like the 212-source trail above.
- Cross-check across models. Convergence between two AIs and a human is stronger evidence than any single forecast. Divergence is a flag to dig deeper.
- Pilot on decisions you already make. Demand forecasting, churn, deal close rates. Score the AI against your own track record before trusting it.
- Watch the money, not the marketing. Claims of 100,000x returns could be luck. Sustained market-beating margins across Kalshi, Polymarket, and equities are harder to fake.
One caution on the eye-popping returns: turning $35 into $2 million can be skill or a lottery ticket that paid off. The market-neutral results are the more credible tell. Either way, the direction is set. For the full walkthrough and the Metaculus data, the original piece at Astral Codex Ten is worth your time.