AI Data Extraction: Stop Hallucinations with 'Do Not Infer'

Ask an AI to “extract all items from this receipt” and watch it confidently invent numbers it doesn’t know.

One developer building an AI shopping app ran into this exact wall. The receipts were real, the model was capable, and the pipeline looked solid on paper. But the output kept drifting. Quantities were off. Prices got rounded in ways the original receipt never showed. Items that appeared once got duplicated. Tried every obvious fix: rephrasing the prompt, switching models, adding examples. The thing that actually moved the needle wasn’t a better model or a longer prompt. It was two words.

“Do not infer.”

The old approach vs. what actually worked

The vague version looked like this: “Extract all items from this receipt.” Result: inconsistent JSON structure, missing items, and totals showing up as line items. Classic hallucination behavior when the model fills in gaps with plausible-sounding data.

Here’s what that actually looks like in practice. A receipt shows “Organic Milk x2” with no unit price listed, only the line total. The model sees $5.98, does the math, and returns a unit_price of $2.99 that was never on the paper. That calculated value then flows into your database as extracted fact. Your downstream logic trusts it. Your reports are now built on a number the AI made up, not a number the receipt contained. At scale, across thousands of receipts, those confident guesses add up fast.

The fix wasn’t more context. It was tighter constraints.

The working prompt defined every output field explicitly:

🏷️ name: product name as printed, no interpretation
📦 qty: numeric quantity only
💰 unit_price / total_price: price per unit and line total, separately
⚖️ unit_type: one of (each | kg | lb | L | oz | pack)

And then the rule that changed everything: If a field is not present on the receipt, return null. Do not infer or calculate missing values.

Accuracy jumped roughly 40%. Not from a model upgrade. Not from more tokens. From one sentence that told the model exactly what it was not allowed to do.

Why this works

Language models are trained to be helpful. So when data is missing, they fill in what seems right. A price is blank? The model calculates from context. Quantity not listed? It assumes one. Unit type ambiguous? It picks the most common option for that product category. Every “helpful” guess is a potential error in your pipeline.

“Do not infer” short-circuits that instinct. You’re not asking the model to be smarter. You’re asking it to be honest about what it doesn’t know.

This is fundamentally different from how most people think about prompt engineering. The default instinct is to give the model more information, more examples, more context. But for structured extraction, the bigger problem is usually over-generation, not under-generation. The model is producing too much, not too little. Constraining what it cannot do is often more effective than expanding what it can.

There’s a second fix buried in the original post that also matters: stop asking for everything in one call. When you request line items and totals together, the model starts treating calculated totals as extracted items. A subtotal becomes a line item. Tax gets listed alongside products. The structure collapses because the model is trying to satisfy two different tasks at once. Keep extraction and aggregation separate. Let the first call pull raw data. Let the second call compute derived values. That boundary alone removes an entire category of errors.

How to apply this to any extraction prompt

Define every output field with a name and exact format description. No ambiguity. If a field should be a number, say “numeric only.” If it should match printed text exactly, say “as printed, no normalization.”
Specify value constraints explicitly: numeric only, enum options, text as-printed. For enums, list every valid option. If the model sees something that doesn’t match, it should return null, not pick the closest option.
Add an explicit null rule: “If the field is missing, return null. Do not calculate or estimate.” Put this at the end of your field definitions where it’s impossible to miss.
Split extraction from aggregation into separate prompts or calls. Never ask the model to both read and compute in the same pass.

The cleaner the schema, the less room the model has to improvise. And the more specific your null rule, the less your pipeline has to defend against confident wrong answers.

If you’re building any structured extraction pipeline, lead with the constraint layer, not just the instruction layer. Tell the model what it cannot do as clearly as you tell it what to do. That’s the prompt discipline most people skip. It’s also the one that makes the biggest difference once your pipeline hits real-world data at any meaningful volume.

Prompt structure that improved receipt data extraction accuracy by ~40% — sharing what worked
by u/AdEfficient8374 in PromptEngineering

The old approach vs. what actually worked

Why this works

How to apply this to any extraction prompt

Related: