Smart Prompt Routing: Why 60% of LLM Calls Waste Money

Sending every prompt to a frontier model for optimization actually makes some outputs worse.

Not occasionally. Consistently. The LLM adds complexity where none is needed, burns latency on a call that shouldn’t have happened, and you’re paying frontier prices for work a pattern-matching rule could’ve done in under 10 milliseconds. The model might rephrase a structured data prompt into something more “natural,” which is the last thing you want when the downstream system expects strict field names in a specific order.

One builder in r/PromptEngineering shared their fix: a 3-tier routing system that figures out exactly how much optimization firepower a prompt actually needs before touching an LLM at all.

The Default Approach vs. What This Does

The standard move: grab any prompt, hit it with “improve this” as a system message, return the result. Simple, fast to ship, and wasteful at scale.

According to the builder’s own data, 40, 60% of prompts don’t need LLM optimization. A Terraform prompt benefits more from IaC-specific structure rules than from a model rewriting it creatively. A JSON conversion prompt needs exact field preservation, not an LLM improvising. Same goes for boilerplate SQL generation, API parameter formatting, or any prompt where the output needs to match a rigid schema. These prompts aren’t complex. They’re just phrased badly, and a deterministic rule can fix that in microseconds.

The router scores each prompt first, then routes it to the cheapest method that actually handles it well. That shift alone changes the economics of running a prompt pipeline at volume.

How the Three Tiers Work 🔀

Tier 1: Rules-based (under 10ms, zero LLM calls)
Pattern-matching applies known transformations based on detected context. Fast, deterministic, free. Routes here when the composite score is 0.40 or below. Think structured output prompts, templated requests, or anything where the intent is unambiguous and the fix is a known substitution. A prompt asking to “convert this CSV to JSON” has one right answer. A rule handles it better than a model that might decide to add helpful commentary.

Tier 2: Hybrid (rules first, then one targeted LLM call)
Rules do the heavy lifting; the model handles the ambiguous parts. Routes here when the score falls between 0.40 and 0.85. This is where most of the interesting work happens: a code generation prompt that’s mostly clear but has one underspecified constraint, or a prompt that’s well-structured but needs the tone adjusted for a specific audience. One focused LLM call is cheaper and often more accurate than a full rewrite.

Tier 3: Full LLM (complete rewrite, highest cost)
Reserved for complex, high-stakes prompts where a full model call is genuinely justified. Routes here at 0.85 or above. Multi-step reasoning prompts, open-ended creative briefs, or anything where the gap between the raw prompt and what the model needs to understand is wide enough that rules can’t close it.

The Scoring Formula

composite = (context_weight × 0.5) + (sophistication × 0.3) + (load_factor × 0.2)

🎯 Context weight (50%): How confident is the detector about what type of prompt this is? High-confidence image generation goes toward LLM. High-confidence structured output goes toward rules. The detector runs on fast heuristics: keyword patterns, token ratios, schema markers. When it’s confident, that confidence carries the most weight in the final score.
Sophistication (30%): Prompt complexity. “Generate hello world” scores low. “Design a multi-region failover with RPO constraints” scores high. The system looks at token count, nested instructions, conditional logic, and domain-specific terminology to estimate how much cognitive load the prompt is actually asking for.
Load factor (20%): Under heavy system load, the router pushes toward rules and hybrid even for prompts that might otherwise qualify for full LLM. This keeps the system responsive during traffic spikes without degrading output quality for the prompts that matter most. It’s a small weight, but it prevents the 25% of prompts that need Tier 3 from creating a bottleneck that slows down the 75% that don’t.

One more guardrail: if context detection confidence drops below 0.6, the system defaults to Tier 1 regardless of other scores. Don’t apply sophisticated optimization to a prompt you can’t confidently categorize.

What the Numbers Look Like 📊

For a typical workload, the distribution shakes out like this:

40% of prompts hit Tier 1, fast, free, done
35% hit Tier 2, one targeted LLM call
25% hit Tier 3, full optimization

Compared to calling GPT-4 on everything: roughly 75% fewer full LLM calls, under 10ms for the rules tier versus 1, 3 seconds, and cost savings that compound fast at scale. At 10,000 prompts per day, that’s 7,500 calls you’re not making at frontier prices. At 100,000, the math becomes hard to ignore. The latency gains matter too, especially in user-facing applications where every added second affects retention.

The tradeoff is real. You have to build the detector and routing logic upfront. That’s real engineering time, and you’ll need enough prompt data to validate that your tier thresholds are calibrated correctly for your specific workload. But once it’s in place, the system is model-agnostic. Swap in Claude, GPT-4.1, Gemini 2.5 Flash, DeepSeek, whatever you’re using. The routing decision stays cheap; the LLM calls stay rare and justified.

Try It

The tool is called Prompt Optimizer. It’s MCP-native with a free tier available. Connect it to your existing prompt pipeline, let the router classify a few thousand prompts, and look at where your Tier 3 traffic is actually coming from. Most teams find a small handful of prompt types driving the majority of their LLM spend. Fix those first. If you’re running any kind of prompt pipeline at volume, it’s worth seeing how much of your current LLM spend is actually pulling its weight.

Frequently Asked Questions

Q: Is this routing system scalable for production use?

Absolutely. The rules-based tier runs in under 10ms with zero LLM overhead, so it handles high-traffic easily. The load factor automatically routes prompts to cheaper tiers when your system is busy, which keeps things from getting bottlenecked.

Q: When should I use this instead of just calling GPT-5 every time?

If you’re dealing with mixed-complexity prompts, this system saves real money and latency. The post highlights that 40-60% of prompts don’t actually need the frontier model, tier 1 handles them in milliseconds. For purely simple prompts or when cost is no object, just calling GPT-5 is simpler.

Q: How do I adjust the routing for my specific use case?

The composite score weights context (50%), sophistication (30%), and load (20%). If you’re optimizing for speed and cost, lower the thresholds to favor rules-based and hybrid tiers. If quality is the priority, raise the thresholds and let more prompts hit the full LLM optimization. Test with your actual prompts to find the right balance.

Building a 3-tier routing system for prompt optimization instead of just calling GPT-5 every time
by u/Parking-Kangaroo-63 in PromptEngineering