Phargo: AI Builds PHP Interpreter with Adversarial Tests

A developer who admits he doesn’t know Rust just watched his AI-written interpreter render a full WordPress front page, and the engine behind it contains zero lines of PHP’s actual source code. According to Hacker News, where the project climbed to 167 points, the from-scratch interpreter is called Phargo, and it now passes 3,844 of PHP’s 22,037 official tests. That’s 17.4% of the entire upstream suite, built up from a starting score of zero.

What stands out here isn’t the pass rate. It’s the method.

The build loop that refuses to cheat

The creator describes his own contribution as “aiming.” The AI writes the code. He points it at a target, reads the output “like a medieval king reviewing naval charts,” and types the most powerful phrase in modern software: “looks good, continue.”

The trick that makes this more than another “it works!” demo is the grading system. Every AI-built project claims success, usually judged by the same AI that wrote it. Phargo doesn’t let the AI grade its own homework. Instead it runs against PHP’s real test suite: roughly 22,000 .phpt files the PHP internals team wrote over three decades. The developer didn’t write them. The AI didn’t write them. They encode every cursed corner of the language, from DateTime daylight-saving math to exactly what var_dump() prints for a float.

The loop is thin:

The AI runs a failure histogram to find the biggest cluster of fixable tests
It implements the fix
It runs the full 22,000-test scoreboard (about 7 minutes)
If the number went up: commit, push, repeat
If it went down: “hmm, that regressed, look again”

That number, as the author puts it, “cannot be flattered, negotiated with, or prompted into a better mood.”

Measure your measurement

The most useful lesson has nothing to do with Rust. Early on, the pass rate plateaued in a way that felt wrong. Whole categories of simple tests failed with diffs that looked identical to the expected output.

The culprit was invisible: carriage returns. The test corpus had been checked out on Windows with CRLF line endings, and the scoreboard compared output byte-for-byte. PHP’s own runner normalizes line endings first. This one didn’t. So the harness had been silently failing nearly every multi-line test for weeks. One line of normalization code flipped hundreds of tests to green instantly.

The takeaway, in his words: “measure your measurement. Your oracle is only as honest as the plumbing that connects you to it.”

Running hostile code without burning down the house

Some of those 22,000 files are bombs. Not malicious, just accidental: regression tests for ancient memory bugs, generators that expand into infinity, tests meant to run only inside PHP’s own fenced CI. The developer found this out when his machine hard-restarted. Not a crash, a full black-screen reboot, because a generator test ate every byte of RAM in the house.

The engine got paranoid, and it wears it well:

A capped allocator that physically cannot exceed 6 GiB
A step limit so infinite loops die with an error, not a space heater
Caps on string sizes, array nodes, and output length
A breadcrumb file naming the current test, so hangs are traceable

None of it is glamorous. All of it is the gap between “research project” and “thing that can safely chew through 22,000 hostile files unattended.”

Why this matters

The honest ceiling here is around 40 to 45%, since the rest of the suite tests C extensions like GD, curl, and MySQL drivers that are out of scope. Nobody’s replacing PHP with this.

The real signal is the workflow. An external, adversarial oracle you don’t control turns “AI-built” from a vibe into a number. If you’re shipping AI-generated code, the practical move is to borrow the pattern: find a test suite you didn’t write, wire it up so the score auto-generates, and audit the harness itself before trusting the results. The author’s favorite bug genre proves the point, features that parse, run without error, and do nothing. clone evaluated to NULL engine-wide, quietly breaking every immutable date operation until a test he didn’t write caught it.

The full write-up and live scoreboard are available at the original source.

Read original article

The build loop that refuses to cheat

Measure your measurement

Running hostile code without burning down the house

Why this matters

Related: