Everyone’s Debating Models. The Builders Getting 10x Results Are Doing Something Else.

Most people are fighting the wrong battle.

Claude vs GPT. Benchmark comparisons. Twitter wars about which model is smarter. A developer on r/PromptEngineering just spent six hours watching every major AI agent tutorial of 2026 and came back with a finding that cuts through all of it.

The models are good enough. The gap between Opus 4.6 and GPT 5.4 is nearly irrelevant.

What actually separates people getting 10x results is the architecture around the model.

The Old Way: Just Prompt It Better

Most people treat AI agents like a slightly smarter chatbot. Better prompt in, better output out. So they A/B test system prompts, chase the newest model, and wonder why results are inconsistent.

You see it everywhere. Someone posts a 3,000-word system prompt on Reddit, gets decent results for a day, then the model updates and everything breaks. They spend the next weekend reverse-engineering what changed. Someone else buys a course on “advanced prompting techniques” that’s already six months out of date. The whole cycle repeats.

That’s not the lever. Not anymore.

The problem isn’t the quality of your prompt. The problem is that a prompt disappears the moment the conversation ends. Nothing carries forward. The model starts from zero every single time. You’re not building anything. You’re just typing better.

🏗️ The New Way: Build the Context Layer

The builders winning in 2026 aren’t better prompters. They’re building infrastructure around the model:

  • 📂 Context files that persist across sessions
  • 🧠 Memory files that store what the agent learned over time
  • 🔗 MCP connections giving the agent real tools to work with
  • Reusable skills instead of bloated instruction files

Here’s the number that should stop you cold: reusable skills cost around 53 tokens per turn. Equivalent agents.md entries cost 944+ tokens. That’s an 18x difference. On long sessions, that gap doesn’t just hurt performance. It compounds until the whole thing falls apart.

Think about what that means at scale. Run a 20-turn session with a monolithic instruction file and you’ve burned through roughly 19,000 tokens just on context overhead before the model does a single thing useful. Run the same session with modular skills and that overhead drops to about 1,000. The model has more room to think. Outputs get sharper. Errors get rarer. The architecture is doing work the prompt never could.

The Karpathy Method (Three Steps, Seriously That’s It)

Andrej Karpathy’s approach to reliable AI output is almost insultingly simple:

  1. Write a spec before you start. Not a vibe. An actual written spec the model can reference. This doesn’t mean a novel. It means a clear document that defines the goal, the constraints, the expected output format, and any non-obvious rules the model should follow. One page is often enough. The key is that the spec lives outside the conversation, so the model can refer back to it without relying on context that might get compressed or lost.
  2. Maintain a scratchpad as you work. The model tracks its own reasoning and state as it goes. This is different from just thinking out loud. A real scratchpad captures decisions made, dead ends hit, and assumptions the model is operating under. When something breaks halfway through a long task, the scratchpad tells you exactly where the reasoning went sideways instead of forcing you to reconstruct it from scratch.
  3. Feed every failure back into the system permanently. Failures become rules. Rules compound over time. The agent gets something wrong? That failure doesn’t disappear into chat history. It gets documented, categorized, and turned into a constraint the model carries into every future session. The system gets smarter at exactly the points where it was previously dumb.

This approach dropped documented mistake rates from 41% to 11%. Not because the model got smarter. Because the system around it did.

What to Actually Build Right Now

If you’re working with AI agents, the question to ask isn’t “which model should I use?” It’s “what does my model know, remember, and have access to?”

Start with three things:

  1. A memory file. Anything the agent learns that should persist, goes here. Client preferences, past failures, recurring patterns. Make it easy to update. The simpler the format, the more likely you’ll actually maintain it. Plain markdown works. The goal is that the agent walking into session 50 knows everything the agent knew after session 1, plus everything learned in between.
  2. Skills over agents.md. Reusable, token-efficient skill definitions beat monolithic instruction files on every long session. Break your instructions into named skills the model can load on demand. A content generation workflow, a research workflow, a formatting workflow. Each one small. Each one focused. The total context stays manageable even as the system grows more capable.
  3. A failure log. Every time the agent gets something wrong, document it and feed it back. This is the Karpathy loop in practice. It takes about 30 seconds per failure. Over a month of daily use, you end up with a system hardened against every specific failure mode you’ve actually encountered, not hypothetical ones someone else wrote a blog post about.

The Bottom Line

The model debate is noise. The architecture is the signal.

People spending hours comparing models are optimizing the wrong variable. The ones building context layers, memory systems, and reusable skills are the ones actually compounding their results. They get better every week because their systems get better. The model comparers are still stuck asking the same questions and getting the same inconsistent answers.

The gap between a raw model and a well-architected agent isn’t small. It’s the whole game!

Frequently Asked Questions

Q: Why does my agent start acting weird partway through longer sessions?

Token bloat from unscored config. If you’re using agents.md (944+ tokens per turn) or adding lines to your setup without testing them against real failures, you’ll hit token bloat that looks like the model getting dumber. The fix: switch to skills (53 tokens/turn), score your config against actual mistakes, and trim what doesn’t pull its weight. Most people don’t realize their session breaks from tokens, not intelligence.

Q: My agent hallucinates even though I have detailed instructions everywhere. How do I actually fix this?

Instructions alone don’t fix hallucinations if your architecture is inefficient. Use Karpathy’s method: write a spec first (most people skip this step), keep a scratchpad that captures every failure, then feed those failures back into your system permanently. Architecture beats instructions, and feedback loops beat both. Bonus: optimize your token usage by preferring skills over agents.md, that efficiency gain alone helps.

Q: What’s the simplest way to implement Karpathy’s method?

Three steps: (1) write your spec before you touch the code (not after, most people get this backwards), (2) maintain a scratchpad where failures go, (3) feed every failure back into your config. The hidden part most people miss: most setups only implement about 4 out of 12 config best practices. The discipline is scoring your setup against real mistakes instead of just adding rules until something works.

Q: Should I use MCPs or call CLI tools like Playwright directly?

MCPs are convenient but they’re passed on every call, which tanks tokens on long sessions. Models are smart enough to handle CLI now (Playwright, GitHub, etc.). Test both, MCPs win on integration convenience, CLI wins when token efficiency matters. On long sessions, most people end up switching to CLI.

I spent a full day watching every major AI agent tutorial in 2026 – here’s what actually matters
by u/Akhil_vallala in PromptEngineering

Scroll to Top