LLM Tool Use Flaw: Newer Claude Models Struggle with Custom Tools

A newer, more capable model should be better at everything, right? Not quite. According to Simon Willison, who flagged a sharp observation from developer Armin Ronacher, Anthropic’s latest models are actually getting worse at one specific job: using custom edit tools built into third-party coding harnesses.

Armin ran into this while hacking on Pi, his own coding tool. Newer Claude models sometimes call Pi’s edit tool with extra, invented fields stuffed into the nested edits[] array. The edit itself is usually correct. But the model makes up keys that don’t match the schema, so Pi rejects the call and asks it to try again. Annoying, and slow.

Here’s the twist that makes this worth your attention: it’s not the small models fumbling. It’s the flagships.

The myths worth killing

Myth 1: Newer models are strictly better at tool use. Wrong. As Willison reports, both Opus 4.8 and Sonnet 5 trip over Pi’s edit schema, while older Anthropic models handle it fine. The state-of-the-art models in the family are worse at this exact task than their predecessors. Capability isn’t a single dial that only turns up.

Myth 2: Malformed tool calls only come from weak models. Also wrong. Small models do emit garbage tool calls, sure. But this is Opus 4.8, the top of the lineup, inventing fields that were never in the schema. Size and smarts don’t guarantee clean formatting when the tool is unfamiliar.

Myth 3: If it works in Claude Code, it works everywhere. This is the real trap. Armin’s theory: recent Anthropic models have been trained, likely through reinforcement learning, to nail the edit tools baked into Claude Code. That training makes Claude Code hum. It also creates a bias. When a model meets a different edit tool with a different schema, it reaches for the patterns it was drilled on and pollutes the call.

What stands out here is how this mirrors OpenAI. Claude’s edit tool uses search and replace. OpenAI’s Codex uses an apply_patch mechanism, and OpenAI has openly talked about training its models to use that tool well. Both labs are optimizing their models for their own harnesses. Good for their first-party experience. Quietly rough on everyone building outside the walls.

Why it matters now

The coding-agent market is heating up, and a lot of that action lives in third-party harnesses: Pi, Cursor, Cline, Aider, and the rest. These tools let you swap models freely. That flexibility is the whole pitch.

But if each model is silently tuned to its maker’s native tool format, model-swapping stops being free. The same prompt and schema can produce clean calls on one model and broken ones on another. Tool design is quietly becoming model-specific, and that fractures the neutral-harness promise a lot of developers are betting on.

This is significant because it points to a wider pattern. As labs pour RL into their own agent products, they optimize for a narrow, blessed path. Everything off that path can regress, even as benchmark scores climb. Better on paper. Worse in your stack.

What to do about it

For developers building or choosing coding harnesses:

Match the tool to the model. Willison floats the obvious fix: implement multiple edit tools, then route to the one that performs best for whichever model the user picked. Search-and-replace style for Claude, apply_patch style for OpenAI models.
Test tool calls per model, not once. Don’t assume a schema that works today survives the next model release. Validate against each model you support, and re-check on every upgrade.
Build tolerant parsers. If models like to invent extra fields, consider ignoring unknown keys instead of hard-rejecting the whole call. A stricter schema isn’t always the safer one.
Watch your retry rates. A spike in rejected tool calls after a model bump is a signal, not noise.

For businesses picking an AI coding stack: don’t treat model upgrades as automatic wins. Benchmark them inside your actual harness before you roll them out.

The headline lesson is uncomfortable but useful. A model that scores higher can still perform worse in your specific setup, because it was trained for someone else’s. Willison’s writeup is a small bug report with a big implication, and it’s worth reading in full at the original source.

Read original article

The myths worth killing

Why it matters now

What to do about it

Related: