Rococo: Why Doubling AI Tokens May Improve Agent Reasoning

Someone shipped a coding skill this week with one stated goal: make AI agents use more tokens, not fewer. It’s called Rococo. And the fact that it’s not entirely a joke is the interesting part.

Rococo is the mirror image of caveman, a tool that strips AI outputs down to bare-minimum terse responses. Where caveman goes lean, Rococo goes ornate. Indirect. Overfurnished. Ceremonially unnecessary. The underlying technical content stays correct, but the prose wrapping it gets thoroughly decorated.

Here’s the twist: the author benchmarked it, and something unexpected showed up.

Output tokens jumped from 266 to 556 on a standard coding benchmark, a 2.09x increase. In multi-mode testing, average visible completion tokens rose from 124 in plain mode to 393 at the most excessive setting. But the number that actually matters came in a quiet observation from the benchmark notes: “It was supposed to decorate the prose. It may now be redecorating the hallway that leads to it.” The verbosity might be shifting how the model reasons, not just how it talks.

Here’s how the skill works in practice:

⚙️ Install Rococo as a skill in your coding agent environment
📝 Pick your verbosity level via config-based activation (multiple modes, from slightly ornate to maximally excessive)
✅ Guardrails prevent it from decorating JSON outputs, so structured data stays valid
🔍 Run your own benchmarks and watch whether reasoning patterns shift alongside the prose

Pro tip: This isn’t “use this in production.” It’s “use this to understand what actually happens when your agent thinks out loud at full volume.” If you’re exploring how verbosity affects reasoning quality in agentic workflows, Rococo gives you a controlled way to test that hypothesis with real numbers attached.

The community reaction nailed the meta-joke: companies tracking AI token usage with a straight face are creating weird incentives. Someone building a tool that games those incentives in reverse, as a genuine thought experiment with benchmarks, is exactly the kind of creative pressure-testing this field needs right now.

Repo is at github.com/Yifeeeeei/rococo. Worth ten minutes if you’re curious about what lives at the intersection of verbosity and reasoning depth.

What’s the most counterintuitive prompt behavior you’ve tested? Drop it in the comments 👇

Frequently Asked Questions

Q: Does Rococo break JSON or other structured outputs?

Nope. The creator specifically built in guardrails to keep structured formats like JSON intact. Testing confirmed JSON stayed valid even at the highest verbosity settings, so you don’t have to worry about your structured outputs getting decorated.

Q: How much will my token usage actually increase?

Quite a bit. Benchmarks show about a 2x token increase on the Tiny Codex benchmark (266 to 556 tokens), with even steeper jumps at higher settings. Since Rococo has multiple verbosity levels with config-based activation, you can dial in exactly how much extra tokens you want to spend.

Q: Has correctness been tested beyond just counting tokens?

Good question. The creator hasn’t published formal eval results like unit test pass rates across different settings yet. Structured outputs stayed valid in testing, but seeing correctness benchmarks at each verbosity level would be really valuable data.

Q: Can I control how verbose it gets, or is it all-or-nothing?

Full control. Rococo includes multiple verbosity levels that you can toggle via config, so you can pick exactly how ornate your agents should sound.

Inspired by caveman, I built a skill to do the same things with more tokens
by u/Parking_Bite_6416 in PromptEngineering

Frequently Asked Questions

Related: