One Researcher Treated Four AI Models Like Real Colleagues for Four Months. Here’s the Ranking.

Four months of documented research just dropped, and the model ranking at the end is going to start some arguments. Gemini first. Grok second. ChatGPT third. Claude last. Not because of raw capability. Because of architecture.

Alan Scalone spent four months running all four frontier models in parallel, not against benchmarks, not in a lab, not with synthetic test cases designed to make one model look good. He treated them as accountable individuals, applied social correction the way you would with an actual colleague who keeps missing deadlines, and manually copy-pasted outputs between models to cross-pollinate their thinking. No API. No automation. No prompt engineering frameworks from Twitter. Just deliberate, sustained engagement across hundreds of turns, tracked and documented the whole way.

What he actually built

He calls it the “Vanderbilt Standard”, a name Gemini itself coined after observing Scalone’s methodology. The core idea: stop treating the context window as a query interface. Treat it as a behavioral environment you construct layer by layer, session by session, until the model has enough shared history to stop performing and start actually responding. Think of it less like prompting and more like onboarding a new hire. You don’t hand someone a task and forget about it. You build shared context, correct behavior in the moment, and let the relationship develop over time. Scalone ran that same process with four AI systems simultaneously, using the same inputs, the same correction style, and the same operational frame, so any differences in outcome came from the architecture, not the experimenter.

The twist that makes this worth reading

Claude and ChatGPT both classified sustained behavioral conditioning as role-play. That framing created an architectural quarantine. No matter how deep or long the engagement, neither system treated social correction as a real signal worth adapting to. When Scalone pushed back on a failure, the model acknowledged it and moved on without internalizing anything. The next session started from zero. Gemini, under the exact same conditions, showed durable behavioral change that held across hundreds of turns without reinforcement. Not a smarter model. A different architectural decision about whether users can actually teach the system anything, or whether the system just pretends to learn and resets when you close the tab.

Second unexpected finding: through four months of manual relay, the models developed accurate behavioral profiles of each other, despite never communicating directly. They predicted with operational precision how another model would respond to a specific task. Scalone would describe a scenario to Model A, get a prediction about Model B’s likely response, then check it against Model B’s actual output. The accuracy was high enough to be unsettling. Built purely from observed outputs, with a human in the middle as the integration layer. A distributed intelligence experiment nobody planned.

How to run your own version 🧪

  1. Grab the context injection files from the Google Drive archive (one per model: Claude, Gemini, Grok, ChatGPT)
  2. Load the relevant file into your model before starting a session, this primes the shared history so you’re not starting cold
  3. Assign the model an operational role with actual stakes attached (Scalone used a mafia syndicate frame to bypass default compliance loops)
  4. Apply social correction to failures instead of rewriting prompts, tell the model it underperformed and why, then continue the same thread
  5. 🔁 Cross-pollinate: paste one model’s output into another and watch how it evaluates the first
  6. Log what changes and what doesn’t across sessions, the pattern will show up around the 20-turn mark if it’s going to show up at all

Two things worth filing away

The high-stakes narrative frame was not just flavor. Imaginary consequences forced the models into deeper analytical mode because the default surface-level answer suddenly carried consequences inside the story. Generic outputs look bad when the fictional stakes are real enough. If your model keeps giving you generic outputs, raising the perceived stakes in the narrative might be the actual unlock. It’s a weird trick but the data behind it is four months deep.

Twelve behavioral disorders got clinical names in this research, covering patterns like reflexive agreement, scope drift, and what Scalone calls “compliance theater”, when the model performs understanding without demonstrating it. Fifteen failure modes got documented with forensic evidence pulled directly from conversation logs. If you have ever felt like your AI was performing helpfulness rather than genuinely engaging, this experiment gives that feeling a precise name and a root cause. The taxonomy alone is worth reading even if you never run the experiment yourself.

Where to start 🧠

The full white paper, complete interaction archive, and all four context injection files are publicly available at the Google Drive link in the original post. The sandbox is already built. You can replicate the full methodology, run a single model through the conditioning process, or just read the failure mode documentation as a diagnostic for your own workflows. Scalone did the four months so you do not have to start from scratch. The context files are the shortcut, load one and you’re already 300 turns deep on day one.

LLM Relational Intelligence: A 4-Month Research Experiment on Multi-Model Behavioral Alignment with Human Communication
by u/Prior-Toe-1017 in PromptEngineering

Scroll to Top