Harness Engineering: Fix Broken AI Agents in Production

Quick Start: You’ll learn why production AI agents keep breaking and how to build the deterministic “harness” layer that actually fixes it. No special tools required, just 4 backend primitives and about 20 lines of code.

Most developers debugging a broken AI agent reach for the same tool: the system prompt. They add more instructions, more examples, more constraints. The agent still breaks.

According to 2026 enterprise data, 88% of AI agent projects fail to reach production for exactly this reason. And it has nothing to do with the prompts.

The fix is a concept called Harness Engineering, recently framed by developer Mitchell Hashimoto. Here’s what it actually means for anyone building production AI apps.

The Old Way vs. The New Way

The old way: describe what you want in the system prompt and let the AI handle orchestration, retries, and state tracking.

The problem? Task routing, failure handling, and state management are classical computer science problems. They need to be deterministic. Leaving them to an LLM is a prayer, not a plan. Think about what happens when your network drops mid-task: the LLM doesn’t know if it succeeded. It guesses. Sometimes it guesses wrong twice.

The new way treats every AI agent as two separate layers:

The Brain (LLM layer): Decides what task to tackle next. Evaluates output quality. Provides feedback for revisions.
The Body (Harness layer): Handles everything else. Routing, retries, state tracking, failure recovery. All deterministic.

The counterintuitive part: as models get smarter, the harness matters more. A 100x more capable model is just 100x more capable of making complex mistakes with confidence. GPT-3 would fail obviously. A frontier model will fail in ways that pass your eyeball test and break your users three weeks later.

⚙️ The 4 Primitives You Can’t Skip

If your agent does more than one thing autonomously, your harness needs all four of these:

State Machine: Every task lives in a known state: pending, in_progress, done, failed. Without this, your agent picks up in-progress tasks and executes them twice on every restart. Real example: a code generation agent that restarts mid-run and creates duplicate pull requests because it has no idea it already started the job.
Idempotency Guards: Every operation gets an idempotency key. Network timeout triggers a retry? The harness ensures the user’s card doesn’t get charged twice. This is a 20-year-old pattern from payment systems. AI agents need it just as badly.
DAG (Directed Acyclic Graph): A simple dependency map. Task B doesn’t run until Task A completes. Prevents your agent from writing to a table before the migration runs. Even a basic two-level dependency list drawn on paper beats letting the LLM figure out sequencing on its own.
Priority and Dead Letter Queues: The harness decides what gets worked on first, not the model. When a task fails 3 times, it goes to a dead letter queue so you can debug it instead of wondering where it disappeared. Without this, failed tasks silently vanish and you find out when a customer emails you.

🛠️ The Minimum Viable Harness

You don’t need Temporal or Prefect to start. Here’s the smallest setup that works:

One database table: id, type, status, payload, attempts, error. That’s your entire state machine. SQLite works fine for low-volume apps. Postgres if you need concurrent workers.
A task dispatcher: 20 lines of code that queries the DB for the highest-priority pending task and hands it to the agent. The agent doesn’t choose its own work. This single constraint eliminates an entire category of runaway behavior.
Hard-coded retry policy: Max 3 attempts, exponential backoff. The agent cannot override this. Do not let the model decide whether to retry. It will always say yes.
Deterministic quality gates: Before anything leaves the system, does it compile? Do tests pass? This logic runs outside the LLM. Fail? The harness sends it back. The LLM gets a second shot at fixing the output, not at deciding whether the output is acceptable.

The Architecture-Aware Prompt Structure

Once the harness exists, use this 4-block structure for every prompt:

Role and Constraints: Tell the AI it’s a “harness-aware engineer.” No refactoring untouched files. No installing new dependencies without asking. Scope the role tightly so the model doesn’t expand it on its own.
Harness Rules: Inject your deterministic rules directly into context. RETRY_POLICY: max 3 attempts. TASK_STATES: pending → in_progress. When the model knows the rules exist, it stops trying to invent its own.
Task Format: Specific task ID, exact target state, files in scope, explicit out-of-scope list. Vague task descriptions are where most prompt failures actually originate. The harness forces you to be specific before the LLM ever sees the input.
Response Shape: Force the AI to output a [PLAN] first, then [CHANGES], then a [VERIFICATION] step with exact commands to run against your quality gates. Structured output makes harness parsing trivial and catches hallucinations before they hit your codebase.

If your AI app keeps doing weird things in production, stop adjusting the prompt.

Build the task table. Write the dispatcher. Lock the retry policy. Draw the flowchart.

Prompts give you the intelligence. The harness keeps that intelligence from burning down your production environment. You can find the full breakdown at the original post.

Frequently Asked Questions

Q: What’s the idempotency problem that “catches most people”?

When agents retry tasks due to network drops or crashes, they re-execute the same operations without tracking “this already ran.” Result: duplicate API calls, emails, or database writes appearing weeks later. Build idempotency detection from day 1, it’s harder to retrofit at 2am when you’re debugging phantom duplicates.

Q: How do I know if my reliability problem is a prompt issue or a systems problem?

If your agent works in isolation but fails in production, it’s systems, not prompts. Look for: tasks retrying unpredictably, edge cases recurring every few weeks, or occasional duplicate side effects. These are classical reliability issues (state, retries, orchestration) that tweaking prompts won’t fix.

Q: Can enforcing code patterns in the harness actually catch repeated AI mistakes?

Yes. One team uses post-tool-use hooks to catch patterns they don’t like, imports scattered instead of at the top, for example. This stopped offshore devs from making the same mistakes over and over, cut PR review time, and massively improved code quality. The harness acts as a deterministic gatekeeper.

Q: If the harness controls everything, doesn’t that limit the AI’s creativity?

No, it frees it. Let the LLM handle reasoning and judgment (what it’s actually good at). The harness handles boring, critical stuff: “Did this task already run?”, “Is output valid?”, “Retry or fail?” The AI shouldn’t decide its own retry logic or state transitions, that’s your job.

Stop trying to prompt-engineer your way out of architecture problems. You need a “Harness.”
by u/Exact_Pen_8973 in PromptEngineering

The Old Way vs. The New Way

⚙️ The 4 Primitives You Can’t Skip

🛠️ The Minimum Viable Harness

The Architecture-Aware Prompt Structure

Frequently Asked Questions

Related: