Yesterday a developer called Sleeplesshan published token-router on r/PromptEngineering, a hybrid architecture that cuts cloud input tokens by 99% when working with massive files in Claude Code or Codex. Benchmarks show a 2,000-line infrastructure log going from 41,711 tokens to 131. Latency from 71 seconds to 5. The repo surfaced quietly but the numbers hit hard enough that it climbed to the top of the thread within hours, with engineers sharing their own token horror stories in the comments.
The obvious move when context gets too big is to summarize with a lightweight model. Let Gemma 4 2B chew through the file and hand a digest to Claude. Cheaper, faster, done. And honestly, on paper it sounds reasonable; small models have gotten surprisingly good at general summarization, and for prose-heavy content like meeting transcripts or product docs, that instinct usually holds up.
That approach is a trap the moment you point it at code. A 2B model summarizing a codebase will hallucinate stack traces, drop critical indentation, miss the one variable that breaks everything. You save tokens. You lose the answer. The community already bumped into this wall; one commenter noted hitting the exact same issue with Gemma 2B hallucinating stack frames last month. Another described spending two hours debugging a “fix” that turned out to be based on a function signature the small model had quietly invented. The failure mode is insidious because the output looks confident and plausible right up until it doesn’t compile.
The actual solution: don’t let the small model read the code at all. Use it as a coordinate router only. The semantic understanding stays with the cloud model. The local model just points.
How the routing works 🔧
- Feed the full file + your query to local Gemma 4 2B via Ollama. The prompt enforces a rigid JSON schema and zero conversational output. Gemma’s only job: return line numbers.
{"targets": [{"start_line": 1536, "end_line": 1550}]}, nothing else. The schema constraint is what makes this reliable. Strip the model’s ability to narrate and it stops hallucinating. It’s pattern-matching against token positions, not reasoning about logic. - A Python slicer extracts those exact raw lines from disk. Deterministic. No model touch. The text is pristine. This is the architectural insight that makes everything else work; the extraction step has no intelligence in it, which means it has no failure modes beyond a bad line number. You can audit it, test it, trust it.
- Claude Code or Codex receives the raw slices plus a structural map (function and class outlines for broader context). It sees real code, just the part that matters. You’re not asking Claude to work with a summary or an abstraction. You’re handing it the actual bytes from the file, surgically selected. The cloud model gets to be the smart one. It just gets there faster and cheaper because someone already did the navigation.
Two pro tips worth stealing
Set OLLAMA_KEEP_ALIVE=0s. Gemma unloads from VRAM the instant the JSON coordinates land. Zero background footprint while your IDE is running alongside it. On a developer machine with 16GB of unified memory this matters; you do not want a 2B model sitting resident while Claude Code is holding its own context in a separate window. The routing call takes under a second; there’s no reason to keep the model warm between queries.
Add a reverse context expansion guardrail to your cloud prompt. Explicitly tell Claude: if a missing dependency or variable declaration isn’t in the slice, request a wider line range via the router before generating a solution. This closes the one real gap in the architecture; when the slice happens to land mid-function and the relevant import lives 200 lines up. Without this guardrail, Claude will sometimes make a reasonable guess about what the missing piece is, and a reasonable guess in a production codebase is exactly the kind of subtle bug that survives code review. Build the escape hatch in from the start.
One more thing worth noting: the JSON schema prompt for Gemma is strict by design but you can tune the target window size. The default slice is 15 lines. For sprawling class definitions or deeply nested conditionals, bumping it to 30-40 lines costs almost nothing on the routing side and gives Claude enough breathing room to understand structure instead of just syntax.
Worth trying 🚀
The repo includes a full regression test harness (run_router_tests.py) to verify routing stability as prompts evolve. If you’re running Claude Code sessions on legacy codebases and watching the token bill climb, this is a clean, low-risk pattern to wire in. The setup is a few hours of work. The payoff compounds on every session after that. Legacy infrastructure logs, monolithic service files, sprawling config trees; anything that currently forces you to split context manually or summarize aggressively becomes manageable without sacrificing the quality of the answer on the other end.
GitHub: github.com/sleeplesshan/token-router
Lossless Context Snipping: A Hybrid Prompt Routing Pattern for Claude Code & Codex that Cuts Input Tokens by 99% using Local Gemma 4 2B
by u/Sleeplesshan in PromptEngineering