Optimize AI Agents: Cut Tools & Boost LLM Performance

Six months of building agent prompts. Five assumptions that turned out to be completely wrong. One pattern sitting underneath all of them.

Most people treat MCP tool count like a feature. More integrations, more capable the agent. So they connect everything, wire up every server they might ever need, and wonder why the agent keeps botching simple tasks. We’re talking embarrassing failures: the agent calling a database migration tool when the user asked for a calendar invite, or surfacing a Slack search result when the task was clearly about writing a document. Not edge cases. Repeated, reproducible mistakes on requests that should take three seconds.

Here’s what’s actually happening. Every tool you connect pushes its full description into context on every single turn, whether that tool gets used or not. The model reads all of it before doing anything. You’re not upgrading your agent. You’re burying it under a menu so long it can’t find the kitchen.

The old model vs. what’s actually true

Old mental model: connect everything, let the model sort it out. Bigger context window means more tools fit. More explicit prompts fix bad tool selection. Semantic embeddings are obviously right for routing. Any gateway layer means another service to run and maintain.

Reality: every single one of those is wrong.

A small local model went from basically unusable to genuinely functional on a 100-tool catalog. Same model. Same weights. The only change was ranking the catalog down to the relevant tools before the model sees them. The model was never the bottleneck. The menu was too long. Think about how a human expert works: give a surgeon a room full of every possible medical device and they slow down and second-guess. Give them a tray with five relevant tools and they move with precision. The model behaves the same way. Constraint clarifies.

The context window size argument is especially seductive because it sounds like progress. But fitting more tools into context isn’t the same as the model using them correctly. A 128k context window with 80 tool descriptions still forces the model to do selection across a massive, noisy decision space before it even starts on your actual task. You’re front-loading cognitive load onto every single request.

Five fixes worth making now 🔧

⚡ Cut connected tools aggressively. Every MCP server you connect sits in context every turn. If a tool won’t get used in this conversation, it shouldn’t be visible. Disconnect it. A reasonable starting target: no more than 10 to 15 visible tools at once. If your use case genuinely needs more, that’s a signal to build a routing layer, not to expand the visible set.
✍️ Rewrite tool descriptions like prompts, not docs. You pay tokens for every word, every turn. One verb-led sentence per tool. One engineer had a single tool description longer than his entire system prompt, and most of it was marketing copy the author shipped with the package. Things like “enterprise-grade integration layer with robust error handling and full observability.” Cutting it down to “searches the customer database by email or account ID” was the highest-leverage hour he spent all quarter. Names the action, names the input, done.
📊 Try BM25 before reaching for embeddings. For tool routing specifically, keyword ranking beats semantic search. Tool names and descriptions are short structured strings, not paragraphs. BM25 needs no embedding API, runs completely offline, and outperformed embeddings in every test this engineer ran. It’s the opposite of the document-RAG default, where semantic search shines. Short strings with specific nouns reward exact matching. Use the right tool for the shape of the data.
Don’t write longer prompts to fix selection errors. When the model picks the wrong tool, the instinct is to add more explicit instructions. “Always use the calendar tool for scheduling requests. Do not use the email tool unless explicitly asked.” That’s almost always the wrong fix. Reduce the number of visible tools first. Selection accuracy jumps without touching the prompt at all, because you’ve removed the competing options that were confusing the model in the first place.
A routing layer doesn’t need its own service. You can run catalog search in-process. One tool to search the catalog, one tool to invoke the result. The model sees two tools instead of two hundred. No extra container, no extra port, nothing paging you at 2am. Two functions, one file, ships in an afternoon.

The thing all five have in common

Every fix on that list is subtraction, not addition. Fewer visible tools. Shorter descriptions. A smaller menu. The agent doesn’t need a smarter model or a longer prompt. It needs less noise in the way. This runs counter to every instinct you’ve built from years of adding features to software. More functionality usually means more code. Here it means less context, fewer options, tighter scope.

If you’re building with MCP servers, audit your connected tools before writing another line of prompt engineering. Disconnect anything that won’t get used this session. Rewrite every description down to one sentence. Then see what breaks.

Bet most of it gets better!

Frequently Asked Questions

Q: Should I turn off MCP servers I’m not actively using?

Yes. Every connected server’s tool descriptions sit in context on every turn, even if unused. Only connect servers relevant to your current task to avoid paying token costs for tools you won’t use.

Q: What’s the “subtraction move” for tool selection?

Instead of giving the model access to a massive tool catalog at once, use a two-step approach: first, a lightweight “catalog search” tool that helps find relevant tools by name or category, then invoke the specific tool. This keeps context lean while maintaining flexibility.

Q: How should I write tool descriptions to save tokens?

Treat them like function signatures: one sentence, verb-first, no marketing copy. Rewriting tool descriptions down to one-liners was one of the highest-leverage optimizations discussed and directly improved tool selection accuracy.

Q: Is semantic embedding the best way to rank tools?

Surprisingly, no. BM25 (keyword-based ranking) often outperforms embeddings for short, structured tool names. Rather than assuming embeddings are the obvious choice, test both approaches for your use case.

5 things I believed about MCP and tool use that turned out to be completely wrong
by u/AbjectBug5885 in PromptEngineering

The old model vs. what’s actually true

Five fixes worth making now 🔧

The thing all five have in common

Frequently Asked Questions

Related: