Most prompt engineers spend their afternoons doing the same thing: write a prompt, run it over a batch of examples, squint at the failures, tweak a sentence, run it again.
Repeat until it feels good enough. Repeat again next week when the task changes.
The loop works. But there’s a person stuck inside it who doesn’t need to be there. And every hour spent inside it is an hour not spent on the higher-leverage work: defining what good actually looks like, curating better training examples, or building evaluation pipelines that catch regressions before they reach production.
The manual approach also has a subtler problem. When you tune by feel, the improvements are real but fragile. You fix the failures you happened to inspect. You improve the prompt for the examples you happened to test. Meanwhile, a whole class of edge cases you never thought to check is sitting quietly in your eval set, waiting to surface at the worst time.
The Old Way vs. The Right Way
Old way: you hunt for magic words. You inspect failures by hand, rewrite by feel, and iterate until your afternoon is gone. The prompt gets better because you kept pushing. But the iteration cost is all yours. You are the optimizer. Your intuition is the search algorithm. And human intuition, however sharp, is slow, inconsistent, and not reproducible.
New way: you define the goal, the metric, and the feedback criteria. Then the loop does the hunting.
A developer named Anastasios built exactly this. You hand the system the same things a human would work from: a starting prompt, labeled examples, a scoring metric, and notes on what went wrong. The optimizer runs the iteration itself. Try a variant, score it, read the misses, rewrite the instruction. Repeat. Hundreds of times, in minutes, without anyone watching.
The difference isn’t just speed. It’s coverage. An automated optimizer will explore prompt variants you would never think to try, combinations of phrasing and instruction order and context framing that feel wrong to a human but happen to unlock better model behavior. That’s the part intuition can’t compete with.
🔬 What the Numbers Look Like
The test case was spotting unfair clauses in Terms-of-Service contracts. It’s a task that’s easy to describe but genuinely hard to get right consistently. Contracts use varied language, buried phrasing, and legal hedging that makes clause detection a real challenge even for careful human reviewers.
Starting point: a bare one-line prompt catching 65% of violations. Functional. Not good enough for anything you’d ship.
After the loop ran? 86.5% average accuracy. Same cheap model. Nobody touched a word.
That’s a 21-point accuracy gain from engineering the system, not the prompt. The model didn’t change. The data didn’t change. The only thing that changed was the instruction the model received, arrived at through systematic iteration rather than manual guesswork.
For context, going from 65% to 86.5% on a clause detection task often represents the difference between a prototype and something production-ready. It’s the kind of gap that used to require a bigger model, a fine-tuning run, or weeks of prompt iteration. Here it came from a properly constructed optimization loop running on commodity hardware.
How to Set It Up
The stack is DSPy + GEPA optimizer, open source. Here’s the basic shape:
- ⚙️ Write a starting prompt (it doesn’t need to be good, just directional)
- 📊 Prepare labeled examples with expected outputs
- 🎯 Define your scoring metric (accuracy, F1, whatever the task needs)
- Feed it to the optimizer and let it iterate
The labeled examples are the most important input. You need enough coverage to expose different failure modes, not just a handful of easy cases. Thirty to fifty diverse examples will get you further than two hundred that all look the same. Think about the edge cases that matter in your specific task and make sure they’re represented.
Your scoring metric is where you encode what “good” actually means. If you care about precision over recall, build that into the metric. If false negatives are more costly than false positives, weight accordingly. The optimizer will optimize for exactly what you define, so be precise. Vague metrics produce vague improvements.
The repo is on GitHub (github.com/anastasiosyal/dspy-gepa-optimizer) and runs on your own data. Full writeup with methodology is on Medium. Setup is documented clearly and the configuration is lightweight enough that you can have a first run completed in an afternoon.
The Actual Shift Here
This isn’t about removing humans from AI workflows. It’s about moving them to the right spot: outside the loop, setting objectives and metrics, not inside it, hunting for phrasing.
The human work that matters is upstream. Deciding which task to optimize. Curating examples that represent real distribution. Choosing the metric that captures what actually matters to the end user. Reviewing the final optimized prompt to make sure it isn’t exploiting some quirk of the eval set that won’t generalize. That’s all high-judgment work. The phrasing iteration in the middle? That’s not.
If you’re still hand-tuning prompts by feel, you’re doing the job the system should be doing for you. Your time has a higher use than being a slow, tired optimizer running in a human body.
Build the loop once. Let it tune for you.
Stop tuning prompts by hand. Engineer the loop that tunes them
by u/Anastasiosy in PromptEngineering