Natural Selection Wrote a Better System Prompt Than Its Author Did

Natural selection just got a Python wrapper.

Yesterday, u/QuantumSeeds dropped AutoPrompt: a ~300-line, zero-dependency tool that applies genetic algorithm logic to any text artifact. System prompts, code, regex patterns, config files. You give it a seed file and a criteria file describing what “better” looks like. It evolves. No scaffolding required, no elaborate setup, no third-party libraries to install. The entire logic fits in a single file you can read in ten minutes and understand completely.

What makes this different from prompt chaining or iterative refinement is the fitness model. Most prompt improvement workflows are linear: you write a prompt, test it, tweak it, test again. AutoPrompt replaces that linear process with a branching, competitive one. Multiple variants run in parallel each generation, they get scored against your criteria, and only the winner advances. Everything else gets discarded. The pressure to improve is structural, not manual.

The twist: the mutations aren’t random. Each generation sees what strategies worked and what flopped in previous rounds. The LLM learns from its own history, so variations get smarter over time instead of just being different. Generation 2 knows what generation 1 tried. Generation 3 builds on what generation 2 discovered. The selection pressure compounds. By round four or five, the algorithm is not exploring randomly anymore. It is converging on something.

Here’s the result that got attention: a vague “you are a helpful assistant” system prompt started at 3.2/10. By generation 5, it had self-added structured output rules, tone constraints, and edge case handling. Final score: 9.2/10. None of that was suggested by the author. The algorithm inferred that the criteria implied those additions and kept them because they scored higher. The author never wrote the words “structured output.” The fitness pressure did.

That is the part worth sitting with. The author wrote three lines of criteria. The algorithm wrote the prompt. The author never specified what format the output should take, never mentioned tone, never listed edge cases. Those decisions emerged from the selection process itself, because they consistently improved the score. It is a different model of authorship: you define what good looks like, and the system finds the path there.

How the Evolution Loop Works 🧬

  1. Drop your seed file (a prompt, code snippet, anything text-based). The seed does not need to be good. A rough draft, a placeholder, or a deliberately underspecified starting point all work. The algorithm treats your seed as generation zero, not a finished artifact.
  2. Write a criteria file in plain language describing what “better” looks like. This is where you put your actual requirements. Be specific about outcomes, not mechanics. “Handles ambiguous user input without asking clarifying questions” is a better fitness signal than “be clearer.”
  3. The LLM generates several variation strategies and scores each 0-10 against your criteria. Each variant tries a different approach to improvement. Some will be conservative edits. Some will be aggressive rewrites. The scoring happens in the same step, so you are not running separate evaluation passes.
  4. The best variant survives and becomes the seed for the next round. The losing variants are not completely discarded though. Their strategies are logged and passed to the next generation as examples of what not to repeat, which is where the learning comes from.
  5. Repeat, with each generation learning from what previous rounds tried. Five generations is usually enough. The author found diminishing returns past generation 6 for most inputs, so it is worth running a short trial before committing to a long evolution.

Pro Tips

  • Start with a vague or underspecified seed. The algorithm finds structure you would never think to add manually. If you start with something too polished, the mutations have less room to work with and the gains are smaller.
  • Write scoring criteria in plain language. Words like “clearer,” “more robust,” and “handles edge cases” work well as fitness signals. You can also describe failure modes you want to avoid: “should not ask the user to repeat themselves” is a valid criterion.
  • It is not just for prompts. The author fed it a bubble sort with speed and correctness criteria. It evolved into a hybrid quicksort with insertion sort for small partitions, roughly 50x faster than the seed. The same logic applies to regex patterns, config templates, and anything else where “better” can be described in a sentence or two.
  • Run the same seed with two different criteria files and compare what emerges. The divergence tells you something useful about what your criteria are actually measuring versus what you thought they were measuring. It is a fast way to pressure-test your own definitions of quality.

No API keys required. Runs on Claude or Codex CLI through your existing subscription. The zero-dependency design means you can drop it into any project folder and run it immediately without touching your environment or package manager.

The repo is open source and ready to use: github.com/ranausmanai/AutoPrompt

👉 Grab your worst system prompt, write three lines of fitness criteria, and watch generation 5 surprise you. What are you evolving first? 🚀

I treated Prompt Engineering by Natural Selection, results are cool
by u/QuantumSeeds in PromptEngineering

Scroll to Top