Revolutionize Multi-Agent Prompt Optimization with CANTANTE

One researcher shipped a multi-agent prompt optimizer this week. The clever bit is how it handles credit assignment. Not with a clever hack, but with a proper system that treats the whole pipeline as a trainable object rather than a collection of vibes you iterate on manually.

Tuning prompts across a pipeline is genuinely painful. You fix Agent A, it breaks Agent B, and tracing why takes forever. Most teams just accept fragile pipelines and call it a demo problem. The standard workaround is to manually test combinations, read reasoning traces, form a theory about what went wrong, update the prompt, and repeat the cycle until something sticks or you run out of patience. That process does not scale past three agents. Past five, it becomes guesswork dressed up as engineering. The pain compounds because failures in multi-agent systems are often emergent: no single agent did anything obviously wrong, but the combination produced a bad outcome. You cannot fix what you cannot locate, and locating it manually is the bottleneck that slows every serious deployment.

CANTANTE treats prompts as learnable parameters, not strings you hand-write. You feed it a task reward signal, it figures out which agents deserve credit or blame, and updates their prompts accordingly. Think of it less like a prompt editor and more like a gradient descent loop where the parameters happen to be natural language instructions instead of floating-point weights. The system keeps the logic of your pipeline intact while surgically improving the instructions each agent is working from.

That last part is the hard one. Your reward signal lives at the END of the pipeline. But the prompts you need to fix are buried INSIDE individual agents. Those agents might be separated by several hops of reasoning, tool calls, or intermediate outputs. Most naive approaches just backpropagate a single global signal to every agent equally, which means well-performing agents get noisy updates alongside the ones that actually need fixing. CANTANTE decomposes that global reward into per-agent feedback, so each agent gets its own update signal. It does this using contrastive attribution: comparing runs where the outcome was good against runs where it was bad, then scoring each agent’s contribution to the difference. The agents that swung the outcome get stronger signals. The agents that were consistent get left alone.

How the loop works:

🔄 Propose: local optimizers suggest prompt variations for each agent, drawing on the agent’s reasoning history and the contrastive scores from previous runs
⚙️ Execute: the system runs configs on identical queries and logs full reasoning traces, preserving enough context for the attributer to do its job accurately
🎯 Attribute: a contrastive attributer scores each agent’s contribution to the outcome, comparing successful and failed runs at the step level rather than the pipeline level
📈 Update: per-agent signals feed into CAPO (AutoML 2025) to rewrite instructions algorithmically, producing new prompt candidates that get tested in the next iteration

Each cycle tightens the weakest link rather than randomly perturbing the whole system. After a few rounds, the prompts converge toward versions that consistently produce the target outcome without you ever having to manually read a reasoning trace and guess what the agent was confused about.

The numbers hold up. Programming tasks (MBPP): +18.9 points over DSPy’s best baseline. Math reasoning (GSM8K): +12.5 points. Inference cost: flat. No ensemble bloat. These are not marginal improvements you would need a microscope to see. An 18-point jump on MBPP means a meaningful share of the tasks that were failing before are now succeeding. And because the method does not add extra model calls or parallel sampling to get there, you are not paying a latency or cost penalty for the gains. The system finds efficiency through better instructions, not through brute force.

Pro tip: Use this after your pipeline is already functional. Run it to tighten the agents where failures are subtle and hard to trace manually. It is not a replacement for knowing what your agents should do, but it is a real shortcut for making them do it better. A good starting point is to run CANTANTE on the 10% of your test cases where the pipeline currently fails, let it attribute those failures, and look at which agents are getting the strongest update signals. That alone tells you where your weakest links are, even if you decide to fix them by hand. The attribution output is useful information whether or not you use the automated rewriting.

Repo: github.com/finitearth/cantante | Paper: arxiv.org/abs/2605.13295

🔧 If you are still hand-tuning multi-agent prompts, give this a real task and see what it rewrites.

Frequently Asked Questions

Q: How does CANTANTE handle silent failures where upstream agents change their output format?

CANTANTE’s attribution analyzes reasoning traces to surface format or summarization drift from upstream agents. However, it optimizes prompts, not schemas, so you’d still want explicit output validation (Pydantic, type hints, etc.) separate from the prompt loop to catch inter-agent contract violations early.

Q: Does CANTANTE learn output schemas, or just prompt strings?

Just prompts, for now. You can bake schema constraints into prompts and CANTANTE will optimize around them, but it doesn’t learn the schemas themselves. If your agents communicate across brittle boundaries, keeping output validation separate from prompt optimization is probably the better call.

Q: How expensive is the optimization? How many evaluations before it beats hand-tuned prompts?

Dozens of iterations typically, and each can run hundreds of task completions. Inference stays cheap (same as baseline), but the optimization itself costs real compute. Pilot it on an isolated workflow first to validate the gains justify the upfront cost.

Stop tuning multi-agent prompts by hand: Learning prompts via system-level credit assignment (CANTANTE)
by u/finitearth in PromptEngineering

Frequently Asked Questions

Related: