Yesterday a quiet build shipped to PyPI. The routing heuristic it uses doesn’t look at your token count once.
The usual approach is something like “short prompt = cheap model.” It fails constantly. A 200-token request for a code patch with a stack trace is a completely different shape from a 200-token one-liner rename. Same length, totally different cognitive demand. A blanket length rule misroutes both of them. Teams figure this out after a few weeks of API bills that don’t match their expectations, then patch it with arbitrary length thresholds that still get it wrong half the time. The real problem is that length is a proxy for complexity, and a bad one. You’re measuring the envelope, not what’s inside.
Think about what actually varies between prompts. A rename request needs pattern recognition. A code patch needs causal reasoning about what broke and why. A summarization task needs compression and selection. A JSON reformat needs zero inference at all. These are fundamentally different cognitive shapes, and they map to model capability in predictable ways once you stop squinting at token counts and start looking at what the prompt is actually asking the model to do.
What actually works: route on intent + evidence type. Classify the shape of the request first, then pick the model tier. Reasoning-heavy tasks go to frontier. Structured transforms go to Haiku or Mini. And for genuinely trivial shapes? A local 7B handles it. The prompt never leaves your machine. 🔒 This isn’t a new idea in systems design. Load balancers have been routing on request shape for decades. We just forgot to apply the same logic to inference.
A dev built this into a tool called promptrouter after watching their Claude bill climb on queries that were basically “format this JSON.” They weren’t doing anything wrong architecturally. They just had a single model endpoint handling everything, which is the default path of least resistance. Most teams ship it that way. The bill grows, someone notices, and then they start guessing at fixes instead of measuring the actual distribution of request shapes hitting their stack.
How to Try It
- 🔧 Install:
pip install promptrouter - Define intent categories for your stack (code patch, summarize, reformat, rename…). Start narrow. Four to six categories is enough to cover most of the variance in a typical product. You can refine later once you see how the classifier performs against your real traffic.
- Map each category to a model tier: frontier / mid-tier / local. Be honest about where frontier actually earns its cost. “Ambiguous edge case that requires multi-step reasoning” is frontier territory. “Convert this date string to ISO format” is not.
- 📋 Run a 100-sample eval across Haiku, Sonnet, and Opus to validate your tier assignments before going live. Pull from your real request logs, not synthetic examples. Synthetic evals flatter your routing setup because they’re cleaner than production traffic.
- 🎯 Dispatch only after the classifier clears the shape of the request. The classification step is cheap. Running it on every request before routing adds single-digit milliseconds and pays for itself on the first misrouted frontier call you prevent.
Pro Tips
- Intent routing cut API latency in half for teams who tested it, not just cost. Mid-tier models return faster. If your product is latency-sensitive, that matters as much as the bill.
- Local models for trivial shapes means zero data leaving your infra. Cost win and privacy win at once. If you’re in a regulated industry or handling anything sensitive, the local path also simplifies your compliance story significantly.
- Haiku and Mini crush more “hard” tasks than you’d expect once you actually run the eval numbers. The instinct is to over-provision. Run the eval before you trust that instinct. A lot of what feels like a frontier task is actually a mid-tier task that just sounds important.
- Log your classifier confidence scores from the start. When the classifier is uncertain about a shape, that’s signal. A cluster of low-confidence requests usually means you’re missing a category that your real traffic has already invented for you.
- Revisit tier assignments every 30 days. Model capabilities shift with updates, and a task that needed Sonnet six months ago might run cleanly on Haiku today.
If you’re paying Sonnet prices to reformat JSON, that’s not an AI problem. It’s a routing problem. The frontier budget is finite and valuable. Spend it where reasoning actually changes the output quality, on the code patches and ambiguous edge cases and multi-step plans that smaller models genuinely fumble. Everything else is overhead you chose to pay. Spend the frontier budget where it actually moves the needle. 🚀
Start here: pip install promptrouter and poke at the heuristics.
Most prompts don’t need a frontier model — the hard part is deciding which do
by u/Prestigious-Cat2730 in PromptEngineering