Your Prompt Failures Are Data. VizPy Mines Them Automatically.

A new optimizer dropped this week. It doesn’t ask you to write better prompts. It reads your failures and rewrites them for you.

VizPy is an automatic prompt optimizer for LLM pipelines. No manual tweaking, no prompt archaeology. Feed it your program and a training set, and it learns from what broke. Most teams burn hours staring at bad outputs, tweaking wording, running another eval, still missing the pattern. VizPy treats that whole cycle as a data problem, not a creativity problem. Your failure log is already a signal. The optimizer just knows how to read it.

What’s New

Two methods. Each built for a different job.

  • ContraPrompt mines failure-to-success pairs and extracts reasoning rules. Built for multi-hop QA, classification, and compliance. Results: +29% on HotPotQA, +18% on GDPR-Bench vs GEPA. The core idea is contrastive learning applied to prompt space: it looks at cases where your pipeline got the answer right and wrong on similar inputs, then identifies what changed in the reasoning chain. From those patterns, it synthesizes explicit rules and injects them back into your prompt. Think of it as automated failure archaeology, except it surfaces actionable logic, not just observations.
  • PromptGrad takes a gradient-inspired approach to failure analysis. Better fit for generation tasks and math, where retries tend to diverge instead of converge. Instead of looking for contrastive pairs, it treats the prompt as a differentiable parameter in a soft sense, nudging it iteratively based on where outputs degrade. If your pipeline handles open-ended summarization or multi-step reasoning that doesn’t reduce to a clean right or wrong label, PromptGrad is the one to reach for.

The Twist

Both are drop-in compatible with DSPy programs. You’re not rebuilding anything. The integration is two lines. That’s the part that changes the math on adoption. Most optimization tooling requires you to restructure your pipeline or swap out your abstractions entirely. VizPy slots in on top of what you already have. If you’re running DSPy in production, there’s no migration cost, no rewrite, no rethinking your module structure. You add the optimizer, point it at your existing compiled program, and let it run. The output is a better-prompted version of the same program you already trust.

How to Run It in 3 Steps

  1. 🔧 Install VizPy and point it at your existing DSPy program. No restructuring required. Your current modules, signatures, and chains stay intact.
  2. 📊 Define your metric and pass in a training set that includes failure cases. This is where the quality of your input data matters. The more failure examples you include, the sharper the contrastive signal ContraPrompt has to work with. Even a small set of 20 to 30 labeled examples with a mix of successes and failures is enough to start seeing gains.
  3. 🚀 Run optimizer.compile() and let it extract the reasoning rules automatically. The compiled output is a new version of your program with updated prompts. Swap it in, run your existing eval suite, and compare scores before shipping to production.
optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)

Pro Tips

  • Match the method to the task. ContraPrompt for structured reasoning and classification. PromptGrad for open-ended generation and math. Wrong method on the wrong task and you’ll leave most of the gains on the table. A quick heuristic: if your metric is categorical (correct/incorrect, pass/fail, label match), start with ContraPrompt. If your metric involves a rubric or a judge model scoring output quality, PromptGrad is the better fit.
  • Your worst-performing pipeline is the best test case. That’s where failure pairs are densest and where ContraPrompt will show the biggest lift. Don’t start with your best-performing module. Start with the one that’s dragging down your overall pipeline score. The bigger the gap between good and bad examples, the more signal the optimizer has to work with.
  • Version your compiled outputs. After each optimization run, save the compiled program separately. If a new training batch introduces noisy examples, you want to be able to roll back to the last clean version without rerunning the full optimization cycle from scratch.

Tool of the Day

🛠️ VizPy, automatic prompt optimizer, DSPy-compatible, two methods for two job types. vizpy.vizops.ai

If your LLM pipeline is underperforming, you probably have the training data to fix it already. The failures are sitting in your eval logs right now. VizPy just makes the extraction automatic. Instead of spending a sprint on manual prompt iteration, you run the optimizer, review the synthesized rules, and ship a measurably better pipeline. Worth a spin on your next optimization cycle. 👉 Start with your worst module and let the failure pairs do the work.

VizPy: automatic prompt optimizer for LLM pipelines – learns from failures, DSPy-compatible (ContraPrompt +29% HotPotQA vs GEPA)
by u/se4u in PromptEngineering

Scroll to Top