What if your worst LLM outputs could quietly teach your system to write better prompts? A new open-source tool called VizPy does exactly that. Built by u/se4u, VizPy is an automatic prompt optimizer that learns from your pipeline failures without any manual tweaking on your end. No prompt engineering PhD required, no spreadsheet of handwritten rewrites.
What Makes VizPy Different
Most prompt optimization approaches require you to handcraft examples or rewrite instructions through trial and error. You stare at a bad output, guess at what the model misunderstood, edit the prompt, rerun, and hope for a different result. It works, eventually, but it’s slow and relies entirely on your intuition about why the model failed. VizPy flips the script: it mines your actual failure-to-success pairs and extracts reasoning rules automatically. The tool ships with two distinct optimization methods, each designed for different types of tasks.
ContraPrompt is the headliner. It analyzes cases where your LLM failed, compares them against successes, and distills the patterns into concrete reasoning rules. The author reports a +29% improvement on HotPotQA and +18% on GDPR-Bench compared to GEPA, which is a strong baseline in automatic prompt engineering. If you work with multi-hop question answering, classification, or compliance tasks, this is the method to watch. Compliance tasks in particular benefit because the failure patterns tend to be systematic: the model consistently misses a specific clause type or misreads jurisdiction scoping, and ContraPrompt can surface that pattern from just a handful of examples.
PromptGrad takes a gradient-inspired approach to failure analysis. Instead of mining pairs, it treats prompt improvement almost like a loss function optimization, nudging the prompt iteratively toward outputs that score higher on your metric. The creator recommends this one for generation tasks and math problems where simple retries don’t converge toward a correct answer. Think creative writing consistency, step-by-step arithmetic, or structured output generation where the failure mode is subtle drift rather than a hard wrong answer.
The Twist: It Plugs Straight Into DSPy
Here’s what caught my attention. VizPy isn’t asking you to rebuild your pipeline. Both optimizers are drop-in compatible with DSPy programs, meaning if you already use DSPy for structured LLM workflows, you can swap in VizPy’s optimizer with two lines of code:
optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)
That’s it. Your existing metric function, your existing training set, your existing program structure. VizPy handles the optimization loop behind the scenes. For teams already invested in DSPy’s module system, this is a zero-friction upgrade path rather than a competing framework demanding a rewrite.
🔧 How to Get Started
- Install VizPy from the official site
- Define your evaluation metric (accuracy, F1, whatever fits your task), if you already have one for DSPy, it works as-is
- Prepare a small trainset of input-output examples, ideally including cases you know the model currently gets wrong
- Pick your optimizer: ContraPrompt for QA and classification, PromptGrad for generation and math
- Run
optimizer.compile()and let VizPy learn from the failures in your dataset
💡 Pro Tips
- Start with ContraPrompt if you’re unsure which method fits. Its failure-pair mining approach works well across most structured tasks and the benchmark numbers speak for themselves
- Quality of your trainset matters more than size. A focused set of 50-100 examples with clear success/failure signals will outperform a noisy dataset of thousands. If you’re building that trainset from scratch, run your current prompt on a representative sample, manually label the failures, and use those labels as your ground truth
- Combine with existing DSPy optimizers. Nothing stops you from running VizPy’s optimizer first, then layering a BootstrapFewShot pass on top for additional gains. Think of it as two complementary passes: VizPy fixes the reasoning rules, BootstrapFewShot adds demonstration-level examples to reinforce them
Where It Fits in the Landscape
Automatic prompt optimization is heating up. Tools like DSPy’s built-in MIPRO and GEPA have set strong baselines, but VizPy’s contrastive approach (learning specifically from what went wrong) adds a dimension that pure example-based optimization misses. The +29% HotPotQA improvement over GEPA is significant because HotPotQA requires multi-step reasoning, exactly the kind of task where generic prompt templates fall apart. The model needs to chain evidence across sources, and a prompt tuned only on successes never learns what chain-of-reasoning patterns lead to dead ends.
If your LLM pipeline has a “good enough” accuracy ceiling you can’t seem to break through, VizPy’s failure-mining approach might be the unlock you need.
The tool is still early (the Reddit post is fresh and community feedback is just starting), so expect rough edges and evolving documentation. But the core idea is sound: your failures contain more signal than your successes, and VizPy knows how to extract it. The teams that move on this now will have a tuned pipeline while everyone else is still iterating by hand.
📌 Check out VizPy on vizpy.vizops.ai and give the creator feedback on Product Hunt. If you’re running DSPy pipelines, this could save you hours of manual prompt iteration 🚀
VizPy: automatic prompt optimizer for LLM pipelines – learns from failures, DSPy-compatible (ContraPrompt +29% HotPotQA vs GEPA)
by u/se4u in PromptEngineering