Harness Engineering: Scale AI Agents & Fix Context Issues

Three files deep into a refactor, the AI deleted a function it had written forty minutes earlier. No warning. Just gone.

If you’ve used Cursor, Claude Code, or GitHub Copilot Workspaces on anything beyond a toy project, you’ve met this version of the AI. Brilliant for the first 80%. Then the codebase grows, context fragments, and the agent starts undoing its own work. It’s not malicious. It’s just that the AI genuinely doesn’t remember what it did before. Every session, it’s meeting your project for the first time.

A developer going by u/Exact_Pen_8973 got tired of it and started researching how companies like Stripe handle AI at real scale. Stripe reportedly runs AI that merges 1,300 pull requests a week. Payment systems stay up. The secret isn’t a smarter model. It’s a discipline called harness engineering.

🔍 Why this matters

The 80% wall is real and it has a predictable cause.

AI coding agents are stateless by default. Each session, they reconstruct context from whatever files happen to be open. When a project is small, that works fine. When it grows, the agent loses the thread, makes contradictory changes, and breaks things it built ten minutes ago.

Think of it like hiring a contractor who shows up each morning with no memory of yesterday. Talented, fast, eager to help. But if you don’t leave them detailed notes, they’ll start framing a wall where you just finished drywalling. The notes aren’t optional. They’re the whole job.

Harness engineering fixes this by building the context management, constraints, and verification loops yourself. The AI doesn’t get smarter. It gets a better environment to work in. That turns out to be enough.

🛠️ The exact folder structure to set up today

Here’s what goes where:

CLAUDE.md (or .cursorrules)
Root instruction file. Project overview, tech stack, and rules the AI cannot violate. Write it once, update it when the rules change. Think of it as the onboarding doc for a new teammate who forgets everything between shifts. Be specific: “We use Postgres, not SQLite” or “never remove error handling without a replacement” are the kinds of lines that prevent real bugs.

.claude/rules/
Governance layer. Security policies, coding conventions, anything where “the AI used its judgment” is not a good answer. Put it here and it applies every session. This is also where you capture lessons. When the agent does something unexpected, the rule you write after that incident is worth more than ten you try to write in advance.

.claude/skills/
Repeatable task patterns. If you always run database migrations a specific way, document it once here. Every session follows the same playbook without you having to re-explain it. This is especially useful for things that seem obvious to you but have non-obvious steps the agent will improvise around if you don’t spell them out.

docs/progress.md
This is the one that actually changes things. Have the agent read it at the start of every session and update it at the end. It acts as a handoff note between sessions so context doesn’t evaporate overnight. The agent knows what was completed, what’s in flight, and what decisions were made along the way. Without it, every session starts cold.

The mechanic is simple: when the AI makes a mistake, don’t just fix the code. Update a rule in the harness. The system gets more reliable every session. Compound it over a week and the difference is noticeable.

💡 Tips that make it stick

Start with CLAUDE.md today. Even a rough version outperforms nothing.
Keep docs/progress.md to five bullet points max. What’s done, what’s next, what the blockers are. Anything longer gets skimmed past. If it takes more than 30 seconds to read, it’s too long.
Add rules to .claude/rules/ reactively. Something broke? Write the rule. Don’t try to anticipate everything upfront. You’ll end up with 40 rules that cover things that never happen and miss the one thing that always does.
Version control the harness. If the AI follows a bad rule, you want to roll it back. Treat the harness like production config: changes go through git, not vibes.
Test it in a fresh session. Start clean, see if the agent picks up context correctly. Surface breaks early before they become a debugging session at midnight.
Review the harness every few weeks. Projects evolve, and a rule that made sense in week one might be creating friction by week six. The harness should grow with the project, not become dead weight.

Community experience confirmed it. One commenter noted: “The secret sauce isn’t a bigger model but a harness that matches your repo.” Another added that a docs/progress.md handoff plus hard rules is “the part that survives when the agent decides to remix your repo at 2am.”

That tracks.

🚀 Start here

Create CLAUDE.md in your current side project today. Add the project name, the tech stack, and three rules the AI must follow. That’s it. Ten minutes of setup now saves hours of cleanup later.

The harness gets more powerful as it grows. But it needs a starting point.

Full breakdown with case studies from Anthropic and OpenAI at the original source.

Frequently Asked Questions

Q: Do I need to customize the harness for my project, or can I copy the recommended structure as-is?

Generic folder structures are a solid starting point, but adapt them to your actual tech stack and team practices. Every project is different, treat CLAUDE.md, .claude/rules/, and docs/progress.md as templates, not gospel. As you hit issues, refine the rules to prevent them next time.

Q: Will upgrading to a better AI model make the harness less important?

Nope. The harness (rules, constraints, progress tracking) matters way more than the model itself. A solid harness keeps the AI from losing context between sessions and breaking existing fixes, better models help, but they don’t replace good structure.

Q: Why is docs/progress.md better than just writing a really detailed initial prompt?

The AI reads docs/progress.md at the start of each session and knows what was already done. Without it, the AI wakes up with amnesia and might redo work or break fixes, progress tracking is the difference between one-off fixes and a self-improving system.

Q: Does this harness approach work with Cursor or Copilot, or just Claude?

Works with any AI coding agent, Cursor, Copilot, Claude Code. Just swap CLAUDE.md for .cursorrules if that’s your tool; the underlying principles stay universal.

I was tired of AI coding agents breaking my projects. Here is the “Harness” framework that fixed it.
by u/Exact_Pen_8973 in PromptEngineering

🔍 Why this matters

🛠️ The exact folder structure to set up today

💡 Tips that make it stick

🚀 Start here

Frequently Asked Questions

Related: