CrossGoss Prompt Architecture: Avoid Keyword Hallucination

Someone shipped a daily crossword that writes itself. Feed it today’s news, get a solvable puzzle back every morning. Fun idea. But what makes it interesting for builders is the prompt chain doing the work underneath. There are hundreds of automated content tools out there. Most of them are pipelines stitched together with duct tape: one prompt per job, one API call per step, one failure point for every link in the chain. CrossGoss takes a different approach, and that difference is worth studying even if you never build a crossword in your life.

The project is CrossGoss. New puzzle every morning, fully automated. The builder posted how the prompt architecture works, and there’s one design choice worth understanding. The whole thing runs on a tight loop: pull the day’s headlines, process them into usable clues, assemble the grid, publish. What sounds like four or five steps collapses into something leaner once you look at where the real work happens.

The twist: one pass doing three jobs at once

Most people would write three separate prompts. This uses a single LLM pass that handles all of it:

🔍 Filtering articles (is this summary actually clue-worthy?)
Deduplicating stories covering the same event
Extracting the answer keyword

Efficient on paper. But the keyword extraction step nearly broke the whole thing.

The case for combining these tasks is real. When all three jobs share the same context window, the model can cross-reference them. An article might look clue-worthy in isolation, but if it covers the same event as three others already in the batch, deduplication and filtering are better done together than separately. Separate prompts can’t see each other’s reasoning. A single pass can. You also cut API calls and latency significantly, which matters when this runs every morning on a tight schedule. The tradeoff is debuggability: when something goes wrong, you have to figure out which of the three jobs failed and why, because the signal is buried inside one combined output.

The model kept picking keywords that didn’t appear in the summary. Technically real words, just not in the source text. Vague summaries made it worse: the clues became unguessable, but the model stayed confident anyway. The community confirmed this one fast. Keyword hallucination is a common trap and most people try to solve it with retries. That doesn’t work.

Here’s why retries fail: the model is not making a random mistake. It’s making a systematic one. When the source text is vague, the model reaches for a plausible-sounding keyword that fits the topic rather than one that actually appears verbatim in the text. Run the same prompt again with the same input and you get the same behavior. The bias is baked into the prompt, not the sample. More attempts just waste tokens and time.

The actual fix was getting explicit about what a good clue means: the keyword must appear verbatim in the source text, the answer must be solvable from the summary alone, no invented words. That definition, added directly to the prompt, changed output quality more than any structural tweak. It’s not a trick. It’s just telling the model what you actually want instead of assuming it already knows. Writing the rubric before writing the request is one of the most underused moves in prompt engineering, and CrossGoss is a clean example of why it matters.

How to build something like this 🔧

Fetch and summarize your source articles. Keep summaries tight, two to three sentences max. Longer summaries give the model more surface area to hallucinate from, and the signal-to-noise ratio drops fast.
Single LLM pass: filter (worth a clue?) + deduplicate + keyword extraction. Pass the full batch at once so the model can compare across articles, not just evaluate each one in isolation. That cross-article view is where the single-pass approach actually earns its keep.
In the prompt, define what “good clue” means concretely: length, specificity, source-text anchoring. If you can’t describe a good output in two sentences, the model definitely cannot infer it.
Validate: confirm the extracted keyword appears verbatim in the source summary before it moves forward. This is a simple string check, not another LLM call. Cheap, fast, and it catches the most common failure mode before it reaches your output.
Iterate on the quality definition first, not the prompt structure ⚙️. Most prompt debugging time gets burned rewriting structure when the real problem is an undefined standard for what “correct” looks like.

Pro tips

Retries don’t fix hallucination. Constraints do. Tell the model to output only words that appear in the source text. If you can add a downstream validation step that checks this automatically, do it. Catch failures before they reach your output, not after a user already saw them.
Multi-task prompts hide which job is failing. If quality drops, split the tasks temporarily to find the weak point. Once you know which job is breaking, fix the constraint in the combined prompt and merge them back. Don’t permanently split what should be unified. Use the split as a diagnostic, then collapse it again once you have the answer.
“Good output” is subjective until you write the rubric. Define it in the prompt before you write the request. This applies well beyond crosswords: classification tasks, extraction tasks, summarization. Any time you ask the model to make a judgment call, make the criteria explicit upfront. Vague prompts produce confident wrong answers, and the model will never tell you it’s guessing.

CrossGoss is live at crossgoss.com. The builder is actively iterating and wants feedback on the prompting approach. Worth 5 minutes of your morning. 🧩

Frequently Asked Questions

Q: How do you prevent the model from hallucinating keywords that aren’t actually in the summary?

This is the trickiest part, as one commenter mentioned, models can “hallucinate perfect answers that literally weren’t in the source text.” The fix is constraint-based prompting: explicitly tell the model to extract the keyword directly from the summary text, show it the exact span it’s choosing, and add a verification step. The more explicit you are about this requirement, the better the results.

Q: What makes a “good” crossword clue in your system?

The creator’s biggest insight was that being extremely explicit about clue quality in the prompt makes a huge difference. A good clue should have just enough signal that solvers can reason their way to the answer, without being obvious or vague. Vague summaries tend to produce unguessable clues, so the system learned to filter those out early.

Q: Why include deduplication as part of the prompt chain?

If multiple news outlets cover the same story, deduplication prevents your crossword from including two clues pointing to the same news event. Catching this in the LLM pass (rather than after the fact) is cleaner and saves computation, the prompt chain decides upfront which articles make the cut.

The prompt chain I built to turn news articles into crossword clues
by u/WellSizedWez in PromptEngineering

Frequently Asked Questions

Related: