Model Handoff: Fix AI Agent Memory & Reduce Development Costs

A developer got tired of AI agents forgetting the architecture mid-project and overwriting working code. He fixed it with a single Markdown file. The Model Handoff feature is the part worth stealing.

What it is

BEMYAGENT is one file, BEMYAGENT.md, dropped into your project root. Tell your AI “Execute BEMYAGENT.md bootstrap” and it generates a clean memory structure automatically.

Two folders. Two purposes.

docs/ is immutable truth: architecture overview, code map, feature specs. Think of it as the constitution your AI agent cannot rewrite. The moment you allow an agent to edit its own source-of-truth, you lose the ability to trust what it knows. So docs/ stays read-only, and the protocol enforces that explicitly through instructions baked into the file itself.

The AI is instructed never to read feature specs unless the current task strictly requires it. That is Lazy Loading baked into the protocol itself. Why does this matter? Because agents that read everything upfront burn tokens on context that does not influence the current task. If you are fixing a bug in the payment module, the agent does not need to load the entire onboarding spec. Lazy Loading keeps the context window focused and the cost per task low.

work/ is volatile memory. Every task runs through a Think-Task-Execute cycle. The agent writes out its reasoning before touching any code. If a task is too large to execute directly, the AI must decompose it into sub-tasks first. No blind code generation. This sounds obvious until you have watched an agent confidently rewrite three files and break two of them because it skipped the thinking step. The cycle is what prevents that. It forces the model to surface its plan before it acts, which gives you a natural checkpoint to catch bad assumptions early.

The twist: Model Handoff

This is the clever part.

Switch the agent to INTERACTIVE mode. A heavy model (Claude Sonnet, o1) writes the strategy document and then stops. You swap to a fast, cheap model (Haiku, Flash) and tell it to execute the code.

Heavy model thinks once. Cheap model codes. Token costs drop without losing quality on the planning pass.

The reason this works is that planning and execution require different things from a model. Planning needs deep reasoning: understanding tradeoffs, anticipating edge cases, structuring a sequence of steps that will not conflict with each other. That is where the expensive model earns its price. Execution, once the plan is locked in, is mostly pattern matching. The cheap model reads a clear spec and writes the code to match it. It does not need to reason about architecture because that work is already done.

In practice, a complex feature that might cost you 40 cents with Sonnet end-to-end might cost 8 cents with Handoff. The planning pass runs maybe 15 cents. The execution pass on Haiku costs a few more. The plan does not change between models. You literally copy the strategy document from the heavy model’s output and hand it to the fast model as context.

How to set it up

📁 Clone the repo at github.com/vitotafuni/bemyagent
⚙️ Drop BEMYAGENT.md into your project root. No install, no config file, no environment variables. It is just a Markdown file.
🧠 Tell your AI: “Execute BEMYAGENT.md bootstrap.” The agent reads the file, generates the folder structure, and writes the initial docs/architecture.md based on whatever it can infer from your project.
🔀 Let it scaffold the docs/ and work/ folders. Review docs/architecture.md before moving on. Correct anything wrong. That file is the foundation everything else builds on, so getting it accurate upfront saves you real pain later.
Enable INTERACTIVE mode when working on anything complex. In practice, this means telling the model to stop after producing the strategy document and wait for your confirmation before writing any code.

Works with any AI that reads Markdown: Claude, Cursor, Aider, Cline, Copilot, Gemini.

Pro tip

Only use Model Handoff on tasks touching architecture or multiple modules. For quick bug fixes, skip the planning pass. That is where the savings actually stack up over time.

Here is a simple mental filter: if the task touches more than one file or requires understanding how two parts of the system interact, use Handoff. If it is a single function fix or a copy tweak, go straight to the fast model. You will build an intuition for the line within a week of using the protocol.

One more thing worth knowing: the strategy document the heavy model writes is reusable. If you pause a feature and come back three days later, you do not need to re-explain the architecture to the cheap model. Feed it the strategy document from the first session and it picks up exactly where you left off. That is the sleeper value here. Persistent, transferable plans that outlast the context window.

Have you tried enforcing Lazy Loading in your own agent setups? Drop what worked for you in the comments. 🚀

Frequently Asked Questions

Q: What’s the key benefit of Lazy Loading in BEMYAGENT?

By having the AI only read files when necessary rather than loading everything upfront, you keep the context window lean and focused. Users report this is a “game changer” for reducing hallucinations and preventing accidental code rewrites.

Q: Why is the separation between immutable docs and volatile work important?

The immutable `docs/` folder serves as your locked source of truth for architecture and specs. The volatile `work/` folder is where the AI experiments and iterates. This separation prevents the agent from accidentally modifying your core codebase while exploring tasks.

Q: Should I add a preflight checklist when using BEMYAGENT?

Definitely. Have the AI start by reading the repo tree, reviewing `docs/01-overview`, then only opening files mentioned in the code-map. This reduces accidental edits and helps the AI stay focused on the parts of your project that matter for the current task.

Q: How does the model handoff feature reduce token costs?

Use a capable model (Sonnet or o1) to write the thinking strategy in `01_think.md`, then pause and switch to a fast/cheap model (Haiku or Flash) for execution. You get high-quality planning once at high cost, then low-cost execution, resulting in significant overall savings.

I got tired of AI agents destroying my codebase and eating tokens, so I built a self-bootstrapping Markdown protocol to fix their memory.
by u/vitotafuni in PromptEngineering

What it is

The twist: Model Handoff

How to set it up

Pro tip

Frequently Asked Questions

Related: