X's Algorithm Update: AI Replaces Hand-Tuned Ranking

Most social platforms rank content with hand-tuned rules: follower counts, engagement rates, account age. Engineers write the weights. Engineers update them when things break. X just threw all of that away.

A developer on r/PromptEngineering spent 3 hours digging through X’s newly open-sourced May 2026 algorithm update, over 200 Rust and Python files. The findings point to a near-complete architectural overhaul. Heuristics are gone. LLMs are running core infrastructure now.

Old Way vs. What’s Live

Previously, X’s ranking engine relied on hand-engineered features. A human decided that follower count matters this much, account age matters that much, historical engagement gets this weight. The system was explicit, controllable, and slow to adapt.

The fragility of that approach compounds over time. When platform behavior shifts, engineers have to diagnose which weights broke, update them manually, and re-deploy. At X’s scale, that lag costs engagement. It also means every edge case is a gap in a spreadsheet somewhere, waiting to be exploited by anyone who figures out which signal the system overweights. Gaming heuristics is a solved problem. Gaming a transformer is not.

The new architecture removes all of it. The core ranking layer is a Grok-1 transformer. It takes raw historical interaction sequences and predicts probabilities across 19 distinct user actions, from likes to off-platform shares. No manual weights. No human-tuned rules. The model learns what “good” looks like from behavior, not from what an engineer thought behavior should look like three years ago.

🔬 What the Code Actually Shows

Grok-1 ranks everything Feed decisions run entirely through the transformer. No manual weights. No human-tuned rules. The model learns what “good” looks like from behavior, not from what an engineer thought behavior should look like three years ago. Downstream shares, mutes, and profile visits all factor in as signals. The system is optimizing for the complete interaction, not a single metric.
A VLM scans every post in real time A standalone async Python daemon called Grox pulls from Kafka streams continuously. Vision-Language Models evaluate every post as it’s created, checking it against 7 safety policies using an LLM-as-a-judge pattern. No keyword filters involved. The practical implication here is significant: no keyword blacklist is sophisticated enough to catch context. A VLM can tell the difference between a post about gun safety education and a post glorifying violence. A regex cannot. That distinction, at platform scale, is the difference between useful moderation and constant false positives.
Conditional Chain-of-Thought for hard calls Simple spam gets a temperature of 0.000001. Deterministic, fast, cheap. Ambiguous content (violent footage vs. genuine news coverage) activates what the code calls “Deluxe Mode.” A function named _strip_thinking_restrictions() rewrites the system prompt to allow a <think> block, forcing the model to reason through context before issuing a verdict. This is smart resource allocation. You’re not paying reasoning costs on every piece of content, only on the content that actually needs it. Most production LLM pipelines skip this distinction entirely, which means they’re either overspending on easy cases or getting worse decisions on hard ones. Conditional CoT is the fix.
The Slop Score A specific VLM prompt evaluates text formatting and vocabulary, assigning a slop_score. If the model detects classic LLM syntax patterns (bullet-heavy formatting, hedging phrases, certain transitional constructions), the post’s algorithmic reach gets throttled downstream. Not a soft guideline. Baked into the math. This tells you something concrete: X is actively penalizing content that reads like it was generated and not edited. Writing that sounds like a real person saying something in their own voice gets distributed. Content that feels like an AI output gets buried. The slop score is a direct signal to creators that voice matters more than volume.

The Engineering Detail Worth Stealing

To get reliable structured output at scale, the team skips standard JSON mode entirely. They construct a conversation object and manually append an Assistant message starting with exactly <json>. This forces the VLM to begin generating mid-JSON, bypassing conversational filler completely.

Standard JSON mode still lets models generate preambles, apologetic disclaimers, or re-framings before the actual output. At scale, that’s wasted tokens and parsing headaches. Assistant prefill cuts straight to the data. You’re not asking the model to produce JSON. You’re starting the conversation with the model already mid-JSON, so the only valid completion is the rest of the structure you need. Combine it with a fixed schema and you get structured output predictable enough to drop directly into downstream systems without a validation retry loop.

If you’re running LLMs in production, assistant prefill is one of the highest-leverage reliability techniques available. It cuts hallucination risk and latency in one move.

What to Take Away

X’s updated stack is a practical blueprint for production LLM orchestration:

Transformer-based ranking replaces feature engineering entirely
Real-time VLM moderation via streaming infrastructure (Kafka + async Python)
Conditional reasoning depth scales with content complexity
Assistant prefill forces reliable structured output without API workarounds

The bigger shift: LLMs at companies like X aren’t just generating content anymore. They’re running infrastructure. The techniques here (conditional CoT, assistant prefill, VLM-as-judge) are available to any developer building today. The gap between what’s running at X and what an independent developer can deploy has never been smaller. The code is open. The patterns are documented. The only thing left is building with them.

The full technical breakdown, including exact scoring formulas, Python file walkthroughs, and prompt pipelines, is documented by the analyst and available at the source linked below.

Frequently Asked Questions

Q: Why did X completely ditch heuristics for a transformer?

X swapped out hand-coded features (follower counts, account age) for a Grok-1 transformer that learns directly from what users actually do. The big win: it adapts automatically instead of needing constant manual tweaks. As one commenter noted, this is part of a larger trend where LLMs stop being just content generators and become the backbone of ranking, moderation, and decision-making itself.

Q: What’s “Deluxe Mode” and when does X flip it on?

Simple calls (obvious spam) get fast, deterministic processing. Tricky ones (violent content vs. educational footage?) trigger “Deluxe Mode”, deeper reasoning that allocates compute based on how ambiguous the decision is. Smart: you’re not overthinking every decision, just the ones that need it.

Q: Why the <json> assistant-prefill trick?

Instead of letting the LLM chat before spitting out JSON, X forces the model to start with <json> right away. Eliminates noise, makes results instantly parseable at huge scale. It’s a small detail, but it’s exactly the kind of orchestration trick that matters when you’re moderating billions of posts.

Q: How does Grox keep moderation moving at X’s scale?

Grox is a Python daemon that continuously pulls posts from Kafka and runs Vision-Language Models on each one. Instead of crude keyword filters, it uses “LLM-as-a-judge” to evaluate posts against 7 safety policies. Way more nuanced than rules-based systems.

I spent 3 hours analyzing the new X algorithm source code. They ripped out all heuristics, replaced them with a Grok-1 transformer, and are using conditional Chain-of-Thought for real-time moderation.
by u/Only-Locksmith8457 in PromptEngineering

Old Way vs. What’s Live

🔬 What the Code Actually Shows

The Engineering Detail Worth Stealing

What to Take Away

Frequently Asked Questions

Related: