Leanstral 1.5: Open-source AI for Formal Verification

A new open-source model is aiming to make formal math verification something regular developers can actually use. Leanstral 1.5 launched today, and according to Hacker News, it’s a free, Apache-2.0 licensed model built for proof engineering in Lean 4. The headline number that got people talking: 119 billion total parameters, but only 6 billion active at any time. That’s a lot of capability without the usual compute bill.

What stands out here is the cost angle. Formal verification has always been powerful and mostly locked away behind expensive, research-grade setups. Leanstral 1.5 is trying to change who gets to play.

What it does

Leanstral 1.5 is built to prove mathematical theorems and verify code properties. Per Hacker News, the release covers:

Saturates miniF2F at 100% on both validation and test sets, a cross-system benchmark spanning elementary math up to IMO-level problems.
Solves 587 of 672 PutnamBench problems, the brutal Putnam competition set that demands long proof chains.
New state-of-the-art on FATE-H (87%) and FATE-X (34%), graduate and PhD-level abstract algebra benchmarks covering group, ring, and module theory.
Finds real bugs. Across 57 open-source repositories, it flagged 47 violated properties, 11 of which were genuine bugs. Five had never been reported on GitHub.

The cost story

This is where the launch gets interesting. On PutnamBench, Leanstral edges out Seed-Prover 1.5 by seven problems, but the price gap is enormous. Leanstral runs at about $4 per problem. Seed-Prover’s high setting is estimated at $300 or more, because it burns a budget of 10 H20-days per problem. Another competitor, Aleph Prover, runs $54 to $68 per problem.

Same class of results, a fraction of the spend. That’s the pitch, and it’s a strong one.

Leanstral also shows unusual test-time scaling. As the team raised the token budget per attempt from 50k to 4 million, performance climbed the whole way: 44 problems solved at 50k, 244 at 200k, 493 at 1M, and 587 at 4M. Instead of quitting when a proof gets long, the model keeps reasoning and revising across millions of tokens.

How it was trained

Leanstral 1.5 went through mid-training, supervised fine-tuning, and reinforcement learning with a method called CISPO. It learned inside two environments. In the first, it’s handed a theorem and told to prove or disprove it, submitting proofs and refining based on Lean compiler feedback until it solves the problem or runs out of budget.

The second is more like watching a developer work. Leanstral edits files, runs bash commands, and uses the Lean language server to inspect goals and errors in real time. That lets it handle long-horizon jobs like completing partial proofs in a repository and building helper lemmas over many rounds.

Two case studies worth noting

Hacker News highlights two concrete demonstrations. In one, Leanstral proved the O(log n) time complexity guarantees for a real AVL tree implementation. That run went over 2.7 million tokens across 22 context compactions, working through structural induction and exhaustive case analysis for rebalancing.

In the other, the team wired up an automated bug hunter. Aeneas translates Rust code into Lean, Leanstral infers the intended behavior, then tries to prove correctness properties or their negations. One catch: a sign function in the datrs/varinteger library overflowed on input U64.MAX, a flaw that hadn’t been reported.

Where it lands

A few honest caveats. The provers ranked above Leanstral operate under different conditions, some with natural-language proof guidance, so it’s not a clean apples-to-apples sweep at the very top. And FATE-X at 34% shows there’s real headroom left in the hardest graduate-level math.

Still, the direction matters. A free, openly licensed model that matches expensive systems at a fraction of the cost, and that also catches real bugs in real code, pushes formal methods closer to everyday engineering. The team also fully open-sourced FLTEval, its benchmark based on real pull requests from the Fermat’s Last Theorem repository, where Leanstral beat Opus 4.6 at one-seventh the cost.

For anyone watching AI move from writing code to proving it correct, this is a release to keep an eye on. Full details are available at the original source.

Read original article

What it does

The cost story

How it was trained

Two case studies worth noting

Where it lands

Related: