How LLMs Really Work: Attention & Context Explained

Most LLM explainers go one of two ways. Either “it predicts the next word” (useless), or a 40-page transformer paper with dense math (also useless).

A developer on r/PromptEngineering just did something different. He trained a miniature language model on four sentences about a bank. A river bank. A financial bank. A fisherman casting a net. Then asked it: “The investor walked to the bank to lock his money in…”

The model predicted vault. Not water. Not mud. Vault.

Four sentences. One ambiguous word. One correct prediction. That’s the entire experiment, and it’s a better mental model for how LLMs work than anything you’ll read in a product blog post.

Here’s why that matters.

The old way of thinking about LLMs

“It guesses the next word.” That’s technically true and completely misleading. It implies the model is winging it, running some sophisticated autocomplete, filling blanks like a probabilistic Mad Lib. It’s not.

The “next word” framing also implies randomness where there’s actually structure. When people think models are just guessing, they prompt accordingly: vague, uncontextualized, hoping the model fills in the gaps correctly. Then they complain when outputs drift. They’re not getting bad luck. They’re getting exactly what the attention weights were trained to return.

The new way: the model runs a full disambiguation pipeline on every single token, every time you type anything. “Bank” appears in all four training sentences with different meanings. The model doesn’t know what a bank is. It learns which words cluster around which other words, and uses that to score what comes next.

“Investor” and “money” in the same sentence pull “bank” toward the financial meaning. That’s attention doing its job. Change “investor” to “fisherman” and “money” to “rod,” and the same model picks a completely different next token. Same word, opposite meaning, different context. The model isn’t confused. It’s reading the room.

🧩 The pipeline that turned “bank” into “vault”

📍 Tokenization + Embeddings: the query gets sliced into known tokens, each assigned a vector (its learned “meaning”) plus a position tag. Position matters because “bank” at the start of a sentence behaves differently than “bank” after “investor walked to the.” The model treats both the identity and location of a word as inputs.
🎯 Attention: every token scores how much it should influence every other token. “Investor” and “money” outweigh “walked” when pulling “bank” toward a meaning. This is the mechanism that lets the model hold context across long passages. It’s not reading left to right and forgetting. It’s running a weighted vote across every token in its window simultaneously.
Feed-Forward Network: deeper pattern processing on top of what attention surfaced. Think of this as the model doing second-order reasoning: not just “which tokens cluster together” but “given this cluster, what patterns from training apply.” This is where learned world knowledge gets retrieved and integrated with the local context.
📖 LM Head: the final layer maps back to the vocabulary and scores every possible next token. “Vault” wins. “Mud” loses. Not because the model decided vault is the right answer, but because vault scored highest given every weight applied across every previous layer. The output is deterministic given the inputs. That’s the part the “guessing” framing misses entirely.

Why prompt engineers should care

When a model misreads context, it’s not being dumb. It’s weighting token relationships exactly the way it was trained. That’s why dropping explicit context early in a prompt shifts outputs. You’re changing which tokens pull the attention weights before the model commits to a direction.

Front-load the frame. The model decides fast.

Practically, this means three things. First: context before task. If you need the model to reason about your startup’s churn problem, say “startup, SaaS, subscription churn” before you get to the question. You’re setting the attention weights before the model starts processing the rest. Second: specificity beats length. A vague 500-word prompt is less useful than a precise 100-word one. The model isn’t reading your prompt for thoroughness. It’s building a probability surface from the tokens you give it. Third: if the model keeps drifting toward a meaning you don’t want, the token pulling it there is usually upstream in your prompt. Find it and rewrite it. Don’t add more words at the end hoping to correct course after the attention weights have already settled.

The four-sentence experiment makes this intuitive in a way that transformer diagrams never do. Once you see that “vault” came from a deliberate contextual pull and not a lucky guess, you start writing prompts differently.

The full YouTube walkthrough builds this from scratch with actual code. If you want to understand what’s really happening when ChatGPT finishes your sentence, it’s worth an hour of your time.

LLM internals explained ( Insight of language model head)
by u/abhishekkumar333 in PromptEngineering

The old way of thinking about LLMs

🧩 The pipeline that turned “bank” into “vault”

Why prompt engineers should care

Related: