Your LLM Failures Just Became Your Best Training Data

Fresh off ProductHunt, a new open-source tool just flipped the script on prompt engineering. Instead of guessing why your LLM pipeline keeps failing, VizPy actually learns from those failures and rewrites your prompts for you.

The tool comes from u/se4u on Reddit, and the core idea is deceptively simple: take the prompts that failed, compare them with the ones that worked, and extract reasoning rules automatically. No manual tweaking. No staring at outputs wondering what went wrong. No three-hour debugging sessions that end with you changing a single adjective and hoping for the best.

What VizPy Actually Does

VizPy is an automatic prompt optimizer for LLM pipelines. It ships with two distinct methods, each designed for different failure modes:

  • ContraPrompt mines failure-to-success pairs to extract reasoning rules. Think multi-hop QA, classification, compliance checks. The benchmarks here are wild: +29% on HotPotQA and +18% on GDPR-Bench compared to GEPA. In practice, this means a classifier that was getting 7 out of 10 answers right might jump closer to 9 without you writing a single new instruction manually.
  • PromptGrad takes a gradient-inspired approach to failure analysis. This one shines on generation tasks and math problems where simple retries just don’t converge. If your pipeline keeps producing outputs that are almost right but miss a key step in a chain of reasoning, PromptGrad is where to start.

The Twist Nobody Expected

Here’s what makes VizPy genuinely different from yet another prompt optimization library. Most optimizers treat your prompt as a black box and try variations until something sticks. VizPy does the opposite. It performs a structured analysis of why specific inputs failed, then builds explicit reasoning rules from those patterns.

That means you don’t just get a better prompt. You get a prompt that specifically addresses the failure modes in your data. The optimizer isn’t guessing. It’s diagnosing. Think of it like the difference between a doctor who prescribes the same pill to everyone versus one who orders a blood panel first.

And the whole thing is drop-in compatible with DSPy, so if you’re already running DSPy programs, you don’t need to rewrite anything.

🔧 How to Get Started

  1. Install VizPy from the official site at vizpy.vizops.ai. Check the docs for your Python version requirements.
  2. Pick your optimizer. If your task involves reasoning, classification, or compliance, go with ContraPrompt. For generation or math, pick PromptGrad.
  3. Define your metric function so the optimizer knows what “success” looks like for your pipeline. This is the most important step. A metric like exact_match works for QA tasks; for classification you might use F1 score. Be precise here because VizPy’s entire analysis revolves around this signal.
  4. Compile your program with three lines of code:
    optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
    compiled = optimizer.compile(program, trainset=trainset)
  5. Run your compiled program and compare results against your baseline. The failure-mined rules should immediately improve accuracy on your weakest inputs.

📌 Pro Tips

  • Start with your worst-performing subset. ContraPrompt extracts the most value when it has clear failure-to-success contrasts. Feed it the inputs where your current prompt consistently chokes. A curated set of 50 to 100 hard examples will outperform a random sample of 500 average ones.
  • Don’t mix task types in one optimizer run. If you have classification AND generation in the same pipeline, optimize them separately. ContraPrompt for the classifier, PromptGrad for the generator.
  • Benchmark against your manual prompts first. The +29% HotPotQA improvement is compared to GEPA (another automatic optimizer). Your hand-tuned prompts might already be better or worse, so establish your own baseline before celebrating.

What to Watch Out For

The project is fresh, so community adoption is still early (2 upvotes at time of writing, no comments yet). That means limited real-world battle testing outside the team’s own benchmarks. The DSPy compatibility is a strong signal of serious engineering, but you’ll want to validate those benchmark numbers on your own datasets before going all-in.

Also worth noting: automatic prompt optimization works best when you have a clear, measurable metric. If your task is subjective (“make this sound more professional”), you’ll need to invest time defining what your metric function actually evaluates. Vague metrics produce vague improvements. The more precise your success criteria, the more targeted the rules VizPy extracts from your failures.

Why This Matters

Prompt engineering is still mostly vibes. You tweak a word here, add an instruction there, and hope the output improves. VizPy represents a shift toward treating prompt optimization like actual engineering, with feedback loops, failure analysis, and measurable improvements.

The fact that it works on top of DSPy means it plugs into an ecosystem that’s already gaining serious traction in production LLM pipelines. As more teams move from one-off prompts to structured programs, tools like this become the difference between shipping reliable AI features and perpetually firefighting edge cases.

🚀 Curious about the technical details or want to share your own prompt optimization war stories? Head over to the original discussion on r/PromptEngineering and join the conversation.

VizPy: automatic prompt optimizer for LLM pipelines – learns from failures, DSPy-compatible (ContraPrompt +29% HotPotQA vs GEPA)
by u/se4u in PromptEngineering

Scroll to Top