He Built 30 AI Managers That Actually Disagree With Each Other. Here’s the Architecture.

Yesterday a sharp project landed on r/PromptEngineering and I almost scrolled past it. Someone built a full baseball simulation called Deep Dugout where Claude manages all 30 MLB teams from the dugout. The creator, u/yesdeleon, ran 100 complete AI-managed games for $17.44 total in API costs, with each team’s manager making genuinely different decisions under pressure.

That last part is the hard problem. And the solution is worth understanding.

The Twist: A Manager’s Philosophy Doesn’t Actually Change Behavior

Early versions of Deep Dugout had 30 distinct manager personalities with different voices and worldviews. Data-driven optimizers. Old-school gut-feel veterans. Analytics nerds versus baseball lifers. The problem? They all made the same decisions. The game state information, score, inning, pitch counts, leverage index, was so dominant that personality became decoration.

What actually drove divergence was something more specific: the decision framework section of each prompt. Concrete heuristics. “You pull starters early.” “You ride your guys.” “You never use your closer outside a save situation.” These specific rules gave the AI something to anchor on against the weight of game state data. Philosophy sets a worldview. Decision frameworks set behavior.

That insight transfers to any multi-agent system you’re building.

How the Prompt Architecture Works

Each of the 30 manager personalities is about 800 words, split into three sections:

  • Philosophy: the manager’s worldview (data-first optimizer vs. trust-your-guys veteran)
  • Decision framework: concrete heuristics for specific situations, pitch count thresholds, leverage index cutoffs, platoon matchup rules, closer usage policies
  • Voice: how they explain their choices and reasoning to the audience

A shared _response_format.md file gets appended to all 30 personalities. This enforces consistent JSON output across every manager (action, reasoning, confidence level, alternatives considered) without touching how each manager thinks. The output structure is shared. The reasoning is not.

At each decision point, the AI receives the full game state and returns structured JSON. Confidence levels were never specified in any of the prompts. They emerged anyway: a manager facing bases loaded in the 9th naturally reports 40% confidence, while the same manager handling a clean third inning logs 95%. Design the right field and the behavior grows from it.

🔑 Pro Tips From the Build

Use a query gate to protect your budget. The system only calls Claude when the leverage index hits 1.5 or higher, or when pitch counts climb into trouble territory. Routine situations use a rule-based fallback that runs silently. This dropped API calls from roughly 150 per game to 20-30, making 100-game validation runs practical on a real budget. The original estimate was $200. Total spend was $17.44.

Prompt caching reverses the usual tradeoff. The system prompt is about 1,500 tokens of personality data plus full roster context. With Anthropic’s cache control, subsequent calls to the same cached system prompt cost 90% less on input tokens. The author made prompts longer and richer as a result, because the marginal cost of additional context dropped to nearly nothing after the first call. This is the opposite of the standard “keep prompts short” instinct, and it’s the right call when caching is part of your setup.

Graceful degradation is a prompt engineering problem, not a fallback problem. Every API call falls back to a rule-based manager on parse failure. Getting those fallbacks down from near-100% in early development to under 2 per game required iterating on the response format itself: removing contradictions in the instructions (the prompt told the AI “no code fences” while showing examples inside code fences), adding inline format examples for edge cases, and tightening the JSON schema description. Your fallback rate is a direct, honest measurement of how well your output format is actually specified.

Results Across 100 Games

  • 28.3 API calls per game (down from ~150 without the query gate)
  • 87.8% average confidence across all AI decisions
  • 1.87 fallbacks per game (down from near-100% in early versions)
  • K rate, BB rate, and HR rate all within real MLB benchmark ranges
  • Total cost for 100 AI-managed games: $17.44

What’s Coming

All 30 personality prompts, the full response format spec, and the complete system are going open source soon. If you’re working on multi-agent systems, structured output formatting, or prompt architecture for agents that need to behave differently from one another, this repository will be worth digging into when it drops. The author is active in the comments and happy to answer questions. Find the full breakdown in the original r/PromptEngineering thread.

💬 What would you apply this kind of structured personality architecture to? The baseball context is fun, but the pattern works anywhere you need agents to make genuinely different decisions from the same inputs.

Frequently Asked Questions

Q: If the AI manages the team automatically, what makes this fun or useful?

This isn’t designed as a traditional strategy game, it’s a research tool for studying how AI personalities make decisions under pressure. By simulating 30 teams with distinct decision frameworks, you can observe how different “management styles” (data-driven vs. intuitive, cautious vs. aggressive) handle identical game scenarios. The value is in the system itself, not player interaction.

Q: What’s actually the difference between “personality” and “decision framework”? Doesn’t personality control behavior?

Personality (philosophy + voice) shapes the tone and worldview, but decision framework is what drives different decisions. Early versions had distinct personalities but made identical choices because the game state overwhelmed them. Adding concrete heuristics, like “pull starters early” vs. “ride your guys”, creates real behavioral divergence. Voice is about how they explain themselves; decision framework is about what they actually do.

Q: How does reducing API calls from 150 to 20, 30 per game actually work?

A smart query gate only consults the AI on high-leverage moments (leverage index ≥ 1.5, high pitch counts, tense situations). Routine decisions early in low-scoring games run on pre-defined heuristics instead. This keeps the AI where personality matters most and cuts costs dramatically without sacrificing decision quality.

Q: Can you apply this personality + decision framework approach to other AI projects?

Yes, this pattern transfers to any domain needing distinct AI behaviors: content moderation, customer support, investment advice, etc. The key insight is that voice/personality alone wastes tokens on surface variation. Decoupling output format from behavioral heuristics lets you create meaningful decision divergence without API bloat.

Designing 30 distinct AI personalities that make measurably different decisions under pressure
by u/yesdeleon in PromptEngineering

Scroll to Top