Why LLM-Generated Code Fails: Testing Best Practices

A single benchmark number is upending how developers should think about AI-generated code. According to a detailed technical post on Hacker News, a ground-up LLM-generated Rust rewrite of SQLite performs a basic primary key lookup in 1,815.43 milliseconds. SQLite does the same operation in 0.09 milliseconds. That’s a 20,171x performance gap on one of the most fundamental database operations in existence.

The code compiled. It passed its tests. It reads and writes the correct SQLite file format. On paper, it looked like a working database engine. It wasn’t.

Plausibility Is Not Correctness

This is the core tension the Hacker News analysis surfaces: LLMs optimize for plausibility, not correctness. The Rust reimplementation is not trivial code. It spans 576,000 lines across 625 files, with a parser, query planner, VDBE bytecode engine, B-tree, pager, and WAL. The architecture uses all the right names. The modules are structured correctly. And yet two compounding bugs make it catastrophically slow in practice.

Bug one: the query planner doesn’t recognize INTEGER PRIMARY KEY as an alias for SQLite’s internal rowid. In SQLite’s where.c, a single line converts named column references to rowid lookups, enabling O(log n) B-tree searches. The Rust reimplementation’s is_rowid_ref() function only checks for three literal strings: rowid, _rowid_, and oid. A column declared as id INTEGER PRIMARY KEY, even when flagged internally as is_ipk: true, never triggers the B-tree fast path. Every WHERE id = N query runs a full table scan. At 100 rows with 100 lookups, that’s 10,000 comparisons instead of roughly 700 B-tree steps.

Bug two: every INSERT outside a transaction triggers a full fsync() call. One hundred inserts means 100 disk syncs. That’s why the INSERT benchmark comes in at 1,857x slower than SQLite’s batched mode.

Neither bug is exotic. Both are detectable with targeted benchmarks.

Why This Pattern Matters Beyond One Project

The Hacker News author is careful to frame this as a tools problem, not a developer problem. The failure patterns are produced by how LLMs generate code, not by the skill of the person using them. That framing is important.

External research backs it up. METR’s randomized study and GitClear’s large-scale repository analysis both point to the same finding: these quality gaps are systemic when output isn’t heavily verified. This isn’t an anecdote. It’s a pattern.

What makes this particularly dangerous is the “looks correct” problem. A codebase that fails to compile is immediately visible. A codebase that compiles, passes its test suite, and ships silently broken behavior is much harder to catch without deliberately adversarial testing.

What Practitioners Should Do Right Now

The author’s main takeaway is direct: define your acceptance criteria before the first line of code is generated. That means:

Write benchmarks before prompting. If performance matters, establish baseline numbers from the reference implementation first.
Test for edge cases the code can’t see. LLM-generated tests tend to test what the code already does, not what it should do.
Treat “compiles and passes tests” as a floor, not a ceiling. Especially for systems code, that bar is far too low.
Benchmark the unhappy paths. In this case, the fast path (WHERE rowid = ?) worked fine. The common path (WHERE id = ?) was broken by design.

For businesses deploying AI coding assistants at scale, this is a process design problem. The tooling is moving fast, but the review discipline hasn’t kept up. Teams that define success metrics upfront, run adversarial benchmarks, and treat AI output as a first draft rather than a final product will catch these failures before they ship.

The Broader Signal

The AI coding assistant market is expanding rapidly, with Copilot, Cursor, and a growing list of agentic coding tools pushing more autonomous code generation. The productivity gains are real. The analysis on Hacker News doesn’t dispute that. But as these tools take on more complex, lower-level systems work, the gap between “plausible” and “correct” becomes harder to spot and more expensive to fix.

The lesson isn’t to stop using LLMs for code. It’s to stop treating their output as verified until you’ve verified it yourself.

Read original article

Plausibility Is Not Correctness

Why This Pattern Matters Beyond One Project

What Practitioners Should Do Right Now

The Broader Signal

Related: