40 Claude Prompt Codes Tested Blind. Only 7 Actually Shift Reasoning.

33 out of 40 “secret” Claude prompt codes are tone changes, not reasoning changes. Somebody finally tested them properly, and the results are not what the prompt-hacking crowd wants to hear.

What the study actually did

u/AIMadesy on r/PromptEngineering spent three months running blind A/B tests on 40 codes that circulate on Reddit and Twitter. Fresh context per run, fixed task batteries across coding, analysis, and writing, blind ordering between test and rating, 12 to 20 runs per code. Task batteries covered things like finding logical errors in business proposals, debugging broken Python functions, and identifying unstated assumptions in strategy questions. The blind ordering matters because raters who know which prompt is the “powered-up” version consistently score it higher, regardless of actual output quality.

The codes included /skeptic, L99, GODMODE, ULTRATHINK, “act as a senior engineer,” and 35 others. Only 7 cleared the bar for measurable reasoning change. The other 33 changed how Claude sounds. Not what it thinks.

The 7 codes with real signal

  • /skeptic caught wrong premises in 79% of “should I do X” tests, vs 14% baseline. Biggest delta in the dataset by a mile. What it actually does is reframe the task from answering the question to first interrogating whether the question is well-formed.
  • L99 committed to a single answer 11 out of 12 times, vs 2 out of 12 baseline. Good when you need a decision, not a hedged breakdown. Particularly effective for go/no-go calls where a list of “considerations” is useless.
  • ULTRATHINK hit debugging correctness at 87.5% vs 62.5% baseline. The catch: 3.2x token cost. Not something you use on every prompt. Reserve it for problems where wrong answers have real consequences.
  • /blindspots, /crit, /deep, /premortem showed smaller but real effects on reasoning depth and error-catching. /premortem in particular surfaced failure modes that standard analysis consistently missed, by forcing the model to assume the plan already failed and work backwards.

The placebo hall of fame

These ones sounded powerful. Measured like noise.

  • GODMODE, BEASTMODE, OVERRIDE: Claude sounds more assertive. The reasoning underneath is the same.
  • “You are an expert in X” / “Act as senior engineer”: tone shift, not judgment shift. It changes the register, not the thinking. The model was already drawing on the same knowledge. You just changed the confidence of the delivery, not the quality of the analysis.
  • “Take a deep breath, think step by step”: used to work on older models. Claude 4.x already does stepwise reasoning by default. Now it just adds tokens.
  • Most jailbreak variants: 4.x alignment is robust enough that these mostly add length and nothing else. Several of them actually degraded output quality slightly by pushing the model toward verbose self-justification.

3 ways to apply this today

  1. Decision audits. Add /skeptic to any “should we do X” question. You jump premise-catching from 14% to 79% with one word. No other changes needed. This is the highest-leverage edit you can make to an existing prompt library without rewriting anything.
  2. High-stakes debugging. Use ULTRATHINK when correctness matters more than cost. Critical bugs, security reviews, architecture decisions. Skip it for everyday tasks. The 3.2x token cost is a real budget consideration, but when a wrong answer means a production incident, the math flips fast.
  3. Team prompt libraries. If you’re standardizing prompts across a team, build around the 7. Strip out the magic-word stuff. It reduces confusion, cuts token spend, and stops the “but I heard GODMODE works” arguments in team Slack channels. A library grounded in tested codes is easier to defend and easier to maintain.

Tips and pitfalls

The study has honest limitations: single rater, 12 to 20 runs per code, and models drift. These numbers were gathered on Opus 4.6, Sonnet 4.5, and Haiku 4.5 as of March 2026. The exact percentages will shift as models update. A code that does nothing today could matter after a fine-tune, and one that works now could get baked in as default behavior and stop adding signal. Treat this as a current snapshot, not a permanent rulebook.

The bigger trap is confusing tone for reasoning. A prompt that makes Claude sound more confident is not the same as a prompt that makes Claude think better. One is cosmetic. The other is actually useful. Most “secret prompts” are in the first category. The real question to ask when evaluating any code is: does this change what the model checks for, or does it just change how it presents what it already found?

Prompt of the Day

Try this on your next strategic decision:

Your question here. /skeptic

That’s it. One word appended. You go from 14% premise-catching to 79%. If Claude is about to agree with a flawed assumption, this is what catches it. Run the same question with and without /skeptic on something you’re currently deciding. The difference is usually obvious in the first paragraph of the response.

What to do next

Pull up the prompt templates you use most. Flag every “magic word” or code in them. Check it against the 7 that tested as real. If it’s not on the list, treat it as a style choice, not a reasoning boost.

The full methodology and per-code numbers are in the original gist linked in the Reddit thread. The author is also offering to send task batteries to anyone who wants to run an independent replication. Worth doing if your team is serious about this.

Frequently Asked Questions

Q: When should I use a prompt code instead of just being clear about what I want?

Prompt codes work best when you need Claude to adopt a specific reasoning mode for a particular task, like /skeptic for catching flawed assumptions or ULTRATHINK for debugging. But for most requests, explicit context and a clear ask outweigh fancy phrasing. Start by being specific about your actual task, then layer in a code only if you notice Claude isn’t thinking the way you need.

Q: Why do older prompt codes like “take a deep breath, think step by step” stop working?

Those codes unlocked breakthroughs for older Claude models that didn’t do step-by-step reasoning by default. But Claude 4.x bakes that in now, so the code just adds tokens without changing thinking. This reveals a bigger pattern: codes that exploit a model’s weakness become obsolete once the model improves. Information completeness (what context you provide) matters far more than exact phrasing.

Q: If 33 out of 40 codes just change tone, are they worth using?

Not for reasoning power, but they can be useful style tools. If you need Claude to sound more confident or less hedgy, a tone-shifting code saves you from writing long preambles. Just be honest about what you’re getting: a style change, not a thinking unlock. The 7 codes that measurably shift reasoning (like /skeptic) are the real wins.

Q: Which of the 7 working codes should I try first?

Start with /skeptic if you’re asking “should I do X”, it catches bad assumptions 79% of the time vs. 14% baseline, which is massive for decision-making. For debugging code, ULTRATHINK hits 87.5% correctness but costs 3.2x tokens. /blindspots and /premortem are solid for finding holes in plans. Pick the code that matches your actual need rather than stacking them.

Q: Should I worry about phrasing if context matters more than how I ask?

Save your energy on word-smithing. Research suggests models care far more about what information you provide (full context, actual constraints, examples) than precise phrasing. Focus on being complete and clear rather than optimizing every word, that’s where the leverage is.

I blind A/B tested 40 “secret” Claude prompt codes. Only 7 actually shift reasoning. Raw data inside.
by u/AIMadesy in PromptEngineering

Scroll to Top