AI has a Step 2 problem, and a recent MIT Tech Review piece nails it with a South Park reference. The Underpants Gnomes meme, where mysterious creatures pitch “Phase 1: Collect underpants. Phase 3: Profit” with nothing in between, has become the perfect lens for the AI industry right now. According to MIT Tech Review, companies have built the tech (Step 1) and promised transformation (Step 3), but the path connecting them is still mostly hand-waving.
This isn’t just a cute analogy. It’s the central tension of the entire AI buildout, and it’s why the gap between investor decks and workplace reality keeps widening.
What the studies actually say
MIT Tech Review highlights two pieces of research that sit on opposite ends of the spectrum.
- Anthropic’s job-impact study predicted which roles LLMs will reshape first. Managers, architects, and media workers should brace for change. Groundskeepers, construction workers, and hospitality staff probably won’t notice much. The catch: these predictions are educated guesses based on what LLMs seem good at, not how they perform when the rent is due.
- Mercor’s February study went the other direction. Researchers tested top-tier agents from OpenAI, Anthropic, and Google DeepMind on 480 real workplace tasks pulled from banking, consulting, and law. Every agent failed to complete most of its duties.
One study projects transformation. The other shows the agents tripping over the basics. Both can be true at once, which is exactly the problem.
Why the disagreement is so loud
MIT Tech Review points to a simple but uncomfortable fact: who’s making the claim matters. Anthropic has skin in the game when it predicts white-collar disruption. So does OpenAI when its chief scientist Jakub Pachocki calls AI an “economically transformative technology.”
There’s a deeper bias baked into the bull case. Most of the loudest “something big is coming” arguments lean heavily on how fast AI coding tools are improving. Coding is a clean domain. Inputs and outputs are well-defined, tests are automatic, and feedback loops are tight. Strategic judgment isn’t like that. Other studies cited by MIT Tech Review show LLMs struggle exactly where ambiguity lives, which is where most actual knowledge work happens.
That’s the rub. Extrapolating from coding gains to “AI will replace consultants” skips a few floors of the building.
What this means for practitioners
This matters now because capex is being committed on the assumption that Step 2 will figure itself out. It might. It also might not, and the people writing checks aren’t the ones doing the 480 tasks.
A few practical takeaways:
- Pilot before you platform. Test agents on your actual workflows, not on benchmarks. Mercor’s results suggest the gap between demo and deployment is still huge.
- Watch the source of the claim. A vendor predicting massive disruption is selling something. A field study showing agents fail 60% of the time is harder to ignore.
- Pick coding-shaped problems first. Anywhere you can write tests and verify outputs cheaply, AI is already pulling weight. Anywhere you can’t, expect more friction than the keynote suggested.
- Budget for human-in-the-loop. The honest middle ground for the next 18 months looks like augmentation, not replacement. Plan staffing accordingly.
The regulation angle
MIT Tech Review notes that activist group Pause AI sees Step 2 as regulation. AI boosters tend to skip past it entirely, eyes locked on the sunny uplands. That standoff isn’t going to resolve cleanly. What’s more likely: a patchwork of sector rules, audit requirements, and liability frameworks that fill in the blank one industry at a time.
What stands out here is that the AI industry has spent two years selling Step 3 without ever building a credible Step 2. The companies that quietly do the boring middle work, integration, evaluation, governance, retraining, are the ones who’ll actually collect the profit. Everyone else is just a gnome with a pitch deck.
For the full breakdown, head to the original MIT Tech Review piece.