Prompt Optimization: 6 Algorithms for Better LLM Results

Yesterday a team shipped six prompt optimization algorithms in a single Python library. Clean, Apache 2.0, free to use. Step 2 is where most people will not expect it. This is not another wrapper around a single approach. It is a full toolkit that treats prompt optimization the way ML engineers treat hyperparameter tuning: systematically, with multiple strategies running against real data, and a clear winner determined by actual performance rather than gut instinct.

There is no “best” algorithm.

That is the whole design decision. GEPA, PromptWizard, ProTeGi, Bayesian Search, Meta-Prompt, and Random Search all landed in the same library because different prompts behave differently. Some tasks need broader exploration where the algorithm tries many structurally different phrasings before zeroing in on what works. Others need a judge signal that is tight and task-specific, so the optimizer can make surgical improvements to wording that is already close. The optimizer that wins on extraction might fall flat on a support flow. A classification prompt lives and dies by precision. A generation prompt needs to be evaluated on fluency and relevance in ways that are harder to reduce to a single score. So they shipped all six and let the workflow decide. You pick the algorithm based on how well-defined your eval signal is, not based on a guess about which one sounds more sophisticated.

Before this, the standard prompt improvement loop was: tweak the wording, rerun your examples, hope something improved, ship it, find out three days later it broke something you were not testing. There was no repeatable system. You could not compare two versions of a prompt fairly because you were running them on different examples at different times with no consistent scoring. One person on the team would change a line, see a few outputs look better, and call it done. Someone else would revert it two weeks later because they saw different outputs on a different day. There was no source of truth. No version that definitively outperformed another on the cases that actually mattered. Now there is one.

How the loop works

🧪 Start with a baseline prompt. This is whatever you are currently using in production or the first draft you wrote when building the feature. It does not have to be good. That is the point.
Run it against a real dataset. Not a handful of cherry-picked examples you invented. Pull from actual production logs, real user queries, the edge cases that already burned you once. The quality of your dataset determines the quality of the optimization.
Score the outputs with your own evals. This is where you define what “good” means for your specific task. Exact match, semantic similarity, a human rubric, or another model acting as a judge. The scoring function is your north star. Build it carefully before you run anything.
Let the optimizer generate candidate prompts. Depending on which algorithm you chose, it will explore the space differently. Some mutate your baseline in small increments. Some generate structurally new versions from scratch. Some use results from prior iterations to guide the next candidate. The optimizer is doing the legwork you used to do by hand.
Compare candidates side by side. You see the score for each version on each test case. Not an average impression across a few tries. Actual numbers, on actual inputs, with actual outputs next to them so you can see where each version wins and where it fails.
Keep the version that wins. Commit it. Note which algorithm found it and what your eval scores were at that point. Now you have a baseline with a score attached to it, not just a prompt file with a vague timestamp and no context.
🔁 Repeat when your data or use case changes. Prompts drift. User behavior shifts. The product evolves. Run the loop again instead of guessing what changed.

The mental shift is the real unlock. You stop asking “which wording feels better?” and start asking “which version actually performs better on the cases that matter?” That is a completely different question. One has a measurable answer. This is the same shift that happened in software when teams stopped eyeballing deploy metrics and started using automated performance benchmarks. Nobody argues about which deploy “felt faster.” They look at the numbers. Prompt optimization is finally getting the same treatment, and the teams that internalize this early will have a compounding advantage over everyone still iterating by feel.

Pro tip: If you have a clear eval signal and a tight dataset, start with Bayesian Search. It uses results from previous iterations to guide where it looks next, so it converges faster when the signal is clean. If your task is messier and harder to score, ProTeGi explores more aggressively and surfaces where your baseline is actually failing, which is often the more valuable output early in the process. Do not guess the algorithm. Run two and compare the results. An hour of comparison now saves weeks of manual iteration later. And if you do not have evals yet, start there first. No optimizer can improve a prompt if there is no definition of “better” in place.

Works well for RAG pipelines, support flows, extraction systems, copilots. Anything where prompt quality is measurable and the output changes real outcomes. If you are building something where a better prompt means fewer escalations, higher conversion, or faster resolution times, you already have an eval signal waiting to be written. The gap between where you are now and a systematically optimized prompt is mostly just setup time.

🔍 Search GitHub for “future-agi” to find the repo and documentation. If you are still tuning prompts by gut feel, this is worth a few hours of your time!

We shipped 6 prompt-optimization algorithms (GEPA, PromptWizard, ProTeGi, Bayesian, Meta-Prompt, Random) in one Apache 2.0 Python library.
by u/Future_AGI in PromptEngineering

How the loop works

Related: