Build Self-Improving AI Agents: The Codex Framework

OpenAI has published a new case study showing how it teamed up with Thrive and Crete to build a self-improving tax agent using Codex, OpenAI’s coding agent. According to OpenAI, the system automates filings, improves accuracy over time, and speeds up workflows that used to eat hours of manual work. What stands out is the “self-improving” part: the agent doesn’t just run a task once and stop. It learns from its own output and gets better.

Here’s a practical guide to the approach behind it, so you can apply the same pattern to your own domain.

Quick Start

What you’ll learn: how to build an AI agent that handles a repetitive, high-stakes workflow (tax filings in this case) and improves its own accuracy over time. What you need: access to Codex or a comparable coding agent, a clearly defined workflow, sample data, and a way to measure correct vs. incorrect output.

Pick a workflow with clear right answers. Tax filing works as an agent target because outcomes are verifiable. A filing is either correct or it isn’t. This matters because self-improvement needs a feedback signal. If you can’t tell good output from bad, the agent has nothing to learn from. Start where “correct” is well defined.
Map the manual process before automating it. OpenAI, Thrive, and Crete didn’t hand the whole job to an agent on day one. Document each step a human takes, the inputs they pull, and the decisions they make. This gives Codex a concrete blueprint to follow and shows you where errors tend to creep in.
Let Codex write and run the agent logic. Codex is a coding agent, so it builds the tools the tax agent uses, not just text responses. The point here is that the agent generates and executes real code to process filings. That’s what makes it scale beyond a single chat session.
Build in verification at every stage. Accuracy was a headline result, and it didn’t happen by accident. Add checks that validate the agent’s work against known rules before anything is finalized. Treat verification as part of the agent, not an afterthought. For high-stakes tasks like taxes, an unverified answer is worse than no answer.
Close the feedback loop so the agent improves. This is the “self-improving” core. When the agent gets something wrong, feed that correction back in so future runs avoid the same mistake. OpenAI reports this is how accuracy climbed over time. Each cycle makes the next one stronger.
Keep a human in the loop on edge cases. Automation accelerates the routine work. People still handle the judgment calls. The setup frees experts from repetitive filing so they focus on the cases that actually need a human. Don’t aim to remove people. Aim to remove the drudgery.

Why this matters

This is significant because it’s a working template for agents in regulated, detail-heavy fields, not a demo. Tax, accounting, compliance, and legal work all share the same shape: repetitive, rule-bound, and punishing on errors. An agent that verifies its own work and learns from mistakes fits those fields better than a one-shot chatbot ever could.

It also signals where the agent market is heading. The first wave of AI tools answered questions. This wave does the job, checks itself, and improves. Codex doing the building means the technical bar to ship one of these is dropping fast.

Next steps

Audit your own workflows for one that’s repetitive and has verifiable outcomes. That’s your best first candidate.
Start small. Automate one stage with full verification before expanding scope.
Decide upfront how you’ll capture corrections, since that feedback loop is what separates a static tool from a self-improving one.
Set clear handoff rules for when the agent should escalate to a person.

The full case study, including how Thrive and Crete deployed the system, is available from OpenAI.

Read original article

Quick Start

Why this matters

Next steps

Related: