Citation Bugs Cost One Dev 70% of His Build Time. Here Are the 6 Failure Modes He Mapped.

Building the retrieval pipeline for a legal AI assistant took 30% of the project. Getting the LLM to cite sources correctly took the other 70%.

That’s the ratio a developer shared after shipping an AI research tool for a German law firm. The pipeline was the easy part. The hard part was convincing the model to stop inventing its own citation format. He went through six distinct failure modes before he had something a lawyer would actually trust. Each one was a different flavor of the same problem: the model knew how to find information but not how to attribute it with precision.

Why This Problem Hits Different in Legal Work

In law, vague attribution is worse than no attribution. “According to legal guidelines” is worthless. “Pursuant to Article 32(1)(a) DSGVO as interpreted by the EuGH in C-300/21” is what a lawyer actually needs. The specificity is not pedantry. It is how they verify the answer before they stake their name on it.

This matters because lawyers do not use AI output as a final answer. They use it as a starting point for verification. If the citation is malformed, incomplete, or fabricated, the verification step fails entirely and the lawyer has to redo the research from scratch. A bad citation is not just unhelpful, it actively wastes time. Most developers treat citation as a cosmetic layer added at the end. Legal work reveals it is one of the hardest structural problems in applied LLM development.

The 6 Failure Modes (and What Fixed Each One)

🔎 Failure 1: Vague category citations. Instead of naming a document, the model writes “according to professional literature.” It is citing a metadata label, not a source. The model latched onto the category name in the context because that was the most salient text near the relevant passage. Fix: tell the model explicitly “NEVER paraphrase the category name as a source reference” with concrete examples of the bad pattern. Showing the wrong output alongside the correct output in the prompt was more effective than describing the rule in abstract terms.

⚖️ Failure 2: Internal labels leaking into output. The model outputs “(Kategorie: High court decision)” as if that means something to the reader. It does not. This happened because the chunk metadata was embedded directly in the retrieved text, and the model treated it as part of the citable content. Fix: ban the pattern in your prompt and require the actual document title or court name instead. Better yet, strip metadata labels from retrieved chunks before they hit the prompt.

Failure 3: Wrong authority attribution. A finding from a high court gets credited to a lower one, or the reverse. In legal work, the authority level of the court changes the weight of the ruling entirely. A district court finding and a federal supreme court ruling on the same question are not interchangeable. Fix: require the model to check which category section the document appears in before attributing it, and include a worked example showing the correct logic. Structured prompting with explicit decision steps outperformed general instructions like “be accurate about court levels.”

Failure 4: Flattening divergent positions. When two courts disagree on the same question, the model synthesizes them into a single position, usually whichever had clearer language rather than higher authority. This is one of the most dangerous failure modes because it looks coherent. The output reads well. But it has silently erased a legal tension that the lawyer needs to know exists. Fix: require both positions to be presented separately with source and authority level noted for each. Instruct the model that unresolved tension between sources is itself useful information, not a formatting problem to smooth over.

🧠 Failure 5: False absence claims. “The documents contain no information about X” stated confidently, while the information sat buried in dense legal text two paragraphs down. The model was not lying. It was pattern-matching on surface-level relevance and missing deeper structural connections that required reading the whole clause. Fix: instruct the model not to claim absence without thorough verification, and give it safer fallback phrasing like “the available excerpts may not contain the full details.” That hedge is honest and professional. Confident absence claims are neither.

Failure 6: Overly emphatic language. “Without any doubt.” “Very clearly.” Lawyers find this unprofessional because legal analysis is almost never without doubt. The entire discipline exists to navigate ambiguity. When an AI assistant writes with more certainty than the underlying law actually provides, it signals to the reader that the system does not understand what it is talking about. Fix: tone instruction requiring factual and measured language, letting the cited sources carry the authority. The fix also improved trust scores with the lawyers using the tool, even on answers that were technically correct before.

What This Actually Teaches

None of these are retrieval problems. All six are prompting problems. The pipeline surfaced them, but the fix in every single case was a targeted instruction paired with a concrete example of what not to do.

The broader pattern here: each failure mode required a specific counter-instruction, not a general directive to “be accurate” or “cite sources properly.” Vague prompt rules produce vague compliance. If you want the model to avoid a specific bad behavior, you have to show it that exact behavior and tell it to stop.

If you are building anything where citation precision matters, whether legal research, medical literature, financial compliance, or academic work, these patterns are probably already in your system. You just have not hit the edge case that exposes them yet.

Build your failure mode list before your end users build it for you.

I spent 40% of my development time preventing an LLM from citing sources wrong. here are the 7 failure modes I found
by u/Fabulous-Pea-5366 in PromptEngineering

Scroll to Top