IBM’s New Coding Agent Refuses to Guess. That’s the Whole Point.

IBM Bob dropped this week. It’s an AI coding agent, and before you scroll past, hear me out, because the positioning is actually interesting. Almost every AI coding tool is racing to be faster, more autonomous, more willing to just go edit your codebase. GitHub Copilot autocompletes before you finish the thought. Cursor wants to refactor your entire file in one shot. Claude Code will write whole modules from a single description. IBM Bob is going in the opposite direction.

It targets legacy enterprise stacks: Java, RPG, IBM i environments, compliance-heavy workflows, terminal-based dev setups. We’re talking about systems written before most AI researchers were born, still processing billions of dollars in transactions every single day, maintained by teams who cannot afford surprises. That sounds like a niche nobody wants to be in. Maybe. But it’s also a niche where “just let the agent edit stuff” is genuinely dangerous. A misplaced semicolon in a modern Node app breaks a unit test. The same kind of mistake in an RPG program running bank clearing can break an entire institution’s end-of-day batch. The failure modes are not comparable, and most AI coding tools have quietly been ignored in enterprise environments precisely because they treat the two scenarios the same way.

The twist: IBM Bob reportedly refuses to make up answers about legacy op-codes. Ask it about a fake RPG op-code and it says it doesn’t know. For modern web dev, that’s just baseline behavior. For legacy enterprise code where one hallucinated command can break a system running payroll for thousands of employees, that’s actually a meaningful claim. Think about what this looks like in practice: you’re working on a COBOL or RPG module that processes insurance claims, and you ask about a specific operation code the model doesn’t recognize. A normal LLM would guess confidently and write syntactically plausible code that subtly corrupts the logic. IBM Bob is supposed to stop and say it doesn’t know. That single design decision is the entire product thesis.

The mode separation is where it gets practical:

  • 🔍 Ask Mode , read-only; understand the codebase without changing anything. Use this first when you inherit a legacy system and have no idea what half the functions actually do. It’s pure documentation and exploration, nothing gets touched, no risk created.
  • 📋 Plan Mode , review a plan before any code changes happen. The agent lays out exactly what it intends to do, which files it will touch, and in what order. You approve or reject before a single line moves.
  • ⚙️ Code Mode , actual implementation, once the plan is approved. Changes happen with your explicit sign-off, not autonomously. You know what’s coming because you already reviewed the plan in the step before.
  • 🤖 Orchestrator Mode , more agentic, multi-step workflows for complex tasks. This is the one that requires the most trust, and IBM Bob’s claim is that it won’t hallucinate even here. Think larger refactoring jobs that span multiple files and modules.

That layered structure sounds boring for a side project. It’s exactly the right call when your codebase touches compliance workflows and nobody can afford a surprise edit. The ability to stay in Ask Mode indefinitely, just learning the system with zero risk, is a genuine value-add for any team that inherited decades of undocumented legacy code and has been afraid to touch it.

Pro tip: Test it against real legacy code, not toy examples. IBM i and AS/400 systems have specific quirks around file handling, data structures, and op-code behavior that only surface in actual production files, not in the cleaned-up demos IBM will show you. Pull a real module from your codebase, something that currently runs in production, and probe it on the edge cases. Ask it about less common op-codes, niche file access methods, anything that would tempt a generic LLM to guess.

There’s a 30-day trial with 40 Bobcoins right now. That’s your window to find out if the anti-hallucination claims actually hold up under production conditions rather than controlled demos. One real flag before you go all in: the 45% productivity gain number is self-reported. IBM’s own team ran the test and published the result. Not independent, not peer-reviewed, treat it as directional at best and verify with your own benchmarks. Prompt-injection concerns are already circulating in the security community, which matters a lot for a tool specifically designed to sit inside sensitive enterprise environments. Do not deploy this anywhere critical before your security team has reviewed the architecture. That step is not optional.

🚀 If your job involves maintaining old enterprise Java or RPG code, this is the first AI tool actually built around your constraints. The trial costs nothing but time.

Frequently Asked Questions

Q: How is IBM Bob different from Copilot if mode separation isn’t new?

Good question, mode separation is already in Copilot, but Bob’s real differentiator is the enterprise focus. It’s built specifically around legacy system workflows (RPG, IBM i, Java modernization) and includes hallucination controls that matter more on older stacks where APIs and op-codes don’t auto-lookup online.

Q: Are the mode boundaries (Ask/Plan/Code) real technical controls or just UI labels?

The post doesn’t spell this out, which is exactly worth investigating. A big deal for enterprise is whether modes enforce actual permissions and file access restrictions, or if they’re just workflow prompts. Before buying in, ask IBM directly how they prevent an agent in Plan Mode from accidentally executing changes.

Q: Does IBM Bob actually avoid hallucinating on legacy systems like RPG?

This is the strongest claim in the pitch. Modern web dev tools hallucinate fake APIs constantly, but on legacy systems where op-codes are obscure and don’t have online docs, not inventing answers matters way more. The 30-day trial has 40 Bobcoins, worth testing on real legacy code instead of toy examples to see if this holds up.

Q: Are the 45% productivity claims believable?

Self-reported metrics deserve skepticism, especially without independent testing. The real value test is whether it handles your actual legacy Java/RPG stack, not marketing benchmarks. If you’re in that world, the trial is cheap enough to validate against your real code.

Q: Should we be concerned about security with AI agents in sensitive environments?

Yes. The post calls out prompt-injection risks specifically, and enterprise systems touching production deserve serious security review before deploying any agent. Mode separation helps, but you’ll want to understand their security model and test it in a sandbox first.

IBM’s new AI coding agent is weirdly focused on legacy stacks, and that might actually be the point
by u/Exact_Pen_8973 in PromptEngineering

Scroll to Top