Stop AI Hallucinations: Probably's Deterministic LLM Validation

A new startup just raised real money to attack one of AI’s most stubborn problems: models that confidently make things up. Probably, founded by Peter Elias, closed a $9 million seed round from Andreessen Horowitz, according to TechCrunch AI. The pitch is simple to say and hard to do. Stop hallucinations and basic factual errors from ever reaching the user.

TechCrunch AI reports that Probably is chasing 99.99% accuracy, the kind of reliability you expect from deterministic software but rarely get from a large language model. That target alone tells you the ambition here. Most AI tools today aim to be helpful and mostly right. Probably wants to be right almost all of the time.

What they actually built

The first product is a data science tool. You point it at a complex dataset, ask a question, and get a quick answer. Each result ships with a citation and an audit trail showing how it was produced. That part is becoming standard across serious AI tools.

The interesting part is what sits underneath. Elias calls it a ‘data science mech suit.’ Here’s how it works:

The LLM produces a first-pass answer.
A separate deterministic validator checks that answer against the actual dataset.
Anything that doesn’t match gets bounced back before it reaches you.
The model itself was trained against that validator, so the whole loop is tuned for fast, accurate results.

This is the bit worth paying attention to. The validator is not AI guessing whether the AI is right. It’s deterministic code doing a hard check.

Why the small-model angle matters

Elias makes a claim that runs against the industry’s usual logic. ‘The better your harness engineering is, the weaker the model can be,’ he told TechCrunch AI. His reasoning: if you refine the context tightly enough, the model barely has to work to land on the right answer. It’s mostly an exercise in killing ambiguity.

The practical payoff is cost. Probably’s tool runs on a model Elias describes as ‘four classes weaker than the frontier models.’ That’s weak enough to run on a desktop instead of a data center. No frontier API bills, no per-token meter running every time you ask a question.

The timing is sharp. Token costs are climbing, and plenty of companies are taking a hard look at their AI budgets right now. A tool that runs locally on cheap hardware and still hits high accuracy is an easy sell to a finance team that’s tired of surprise invoices.

Why this matters for the industry

Most of the field has been trying to fix hallucinations from the inside, making models bigger, smarter, and better at policing themselves. Probably is betting the smarter move is engineering around the model with hard validation, not throwing more compute at it.

Elias takes a direct shot at the big labs on this point. ‘I think it’s really interesting that the big AI labs have not even attempted to do this,’ he said. ‘They’re incentivized not to, because they make money the more times you have to correct the model.’ Read that how you want, but the incentive he’s pointing at is real. Usage-based pricing rewards more calls, not fewer.

What stands out to me is the direction. The expensive path is a giant model that tries to be right. The cheap path is a small model wrapped in a tight validation harness. If Probably’s approach holds up, it challenges the assumption that accuracy and frontier-scale compute have to go together.

What to watch next

The data science tool is the starting point, not the endgame. Elias says the same engine can extend to accounting, medical services, and what he calls ‘any precision-sensitive use case.’ Those are exactly the fields where a single hallucination can do real damage, and where deterministic checking earns its keep.

The open question is whether the harness approach scales beyond clean, structured datasets into messier problems. Validation is easy when there’s a dataset to check against. It gets harder when the ground truth is fuzzy. For now, Probably has the funding and a contrarian thesis worth tracking. More details are available in the original TechCrunch AI report.

Read original article

What they actually built

Why the small-model angle matters

Why this matters for the industry

What to watch next

Related: