Most people tweak their model when results disappoint. Swap Claude for GPT, try a newer version, experiment with temperature. The model gets the blame and the credit.
This Reddit post flips that assumption hard. The same model, the same benchmark, wildly different scores depending entirely on how the prompts were written.
The original poster, u/K_Kolomeitsev, spent months building an open-source deep research agent called Agent Browser Workspace that gives LLMs a real browser, then tested it against DeepResearch Bench, a suite of 100 PhD-level research tasks. The final score: 44.37 RACE overall using Claude Haiku, beating Perplexity Deep Research’s 42.25 on the same benchmark. What drove the improvement? Prompt engineering choices. Not model swaps.
Here’s what the author actually changed.
Quick Start
What you’ll learn: five specific prompt engineering strategies for multi-step research agents.
What you need: any LLM, an agent setup with browser or tool access, and a willingness to stop prompting vaguely.
The core idea: specificity beats ambition
The old way to prompt a research agent is something like ”research this topic and write me a comprehensive report.” That sounds reasonable. It produces confident, fluent, plausible-sounding text that may not be grounded in anything the agent actually found. It’s a recipe for hallucination dressed up as analysis.
The approach this expert used instead: break every stage into explicit, verifiable steps. Tell the agent exactly what to do when things go wrong. Define what ”done” looks like. The difference in output quality is not subtle.
Step-by-step: what the creator changed
Step 1: Replace one-shot commands with escalation chains
Half the web doesn’t work with a simple ”get the page content” instruction. JavaScript renders late, content loads lazily, single-page apps serve empty shells on first load.
The prompt that works tells the agent: load the page first. If it’s empty, wait for JS to stabilize. Still nothing? Pull text directly from the DOM using evaluate(). Can’t get text at all? Take a full-page screenshot. Content requires scrolling? Scroll first, then extract.
This single change stopped the agent from silently skipping pages that needed special handling. Fewer skipped sources means deeper research. The innovator calls it an escalation chain, and it’s the clearest example of anticipating failure modes inside the prompt itself.
Step 2: Collect evidence first, write the report last
The standard prompt collapses collection and synthesis into one step. The agent narrates its way through findings without being forced to build a real evidence base.
The creator’s approach separates the two stages explicitly: ”Save search results to links.json first. Open each result one by one. Save content to disk as Markdown. Build a running insights file. Only write the final report after every source is collected.”
There’s a practical bonus here too. If the session crashes mid-run, you resume from the last saved artifact. Nothing is lost. The structure enforces both quality and resilience.
Step 3: Use specific expansion prompts, not vague ”go deeper”
”Research more” is an instruction that means nothing to an agent. What does more mean? More sources? More depth on a specific claim? More coverage of a subtopic?
Replace it with concrete, countable tasks:
- ”Find 10 additional sources from domains not yet in links.json.”
- ”Cross-reference the revenue figures from sources 2, 5, and 8.”
- ”Build a comparison table of the top 5 alternatives mentioned across all sources.”
Every specific instruction produced measurably better output than open-ended ones. The agent knows what to look for. It knows when it’s done.
Step 4: Pre-map site profiles instead of discovering selectors every time
Making the agent rediscover CSS selectors on every page visit burns tokens and produces unreliable results. The agent guesses, often guesses wrong, and on the next visit starts guessing again from scratch.
The solution the author built: store selectors for common sites in JSON profiles. The agent prompt then says: ”Check for a site profile first. If one exists, use its selectors. Discover manually only for unknown sites.” Token waste dropped noticeably. This is the kind of operational detail most tutorials skip, but it compounds over long research sessions.
Step 5: Mandate source attribution with a flag for unverified claims
This one is a single instruction: ”Every factual statement in the report must reference a specific source by filename. If you can’t attribute a claim, flag it as unverified.”
That’s it. The agent can no longer generate plausible text without pointing at where each fact came from. Ungrounded claims get flagged explicitly rather than buried in confident prose. It doesn’t eliminate hallucination entirely, but it surfaces it instead of hiding it.
Old way vs. new way at a glance
| Old approach | New approach |
|---|---|
| One-shot ”get page content” | Escalation chain with explicit fallback steps |
| ”Research and write a report” | Collect all sources first, then synthesize |
| ”Go deeper” | Specific, countable expansion tasks |
| Discover selectors each visit | Cached site profiles, discover only for unknown sites |
| Generate and hope | Attribute every claim or flag it as unverified |
What to try next
If you’re building or using research agents, start with step 5. It’s one sentence and it immediately changes how you evaluate output. Unverified claims that used to hide in confident prose will start surfacing visibly.
Then work backwards: add escalation chains to your page-fetching logic, split your prompts into a collection phase and a synthesis phase, and replace every instance of ”go deeper” with a numbered, specific task.
The full research methodology lives in RESEARCH.md inside the repo, and the toolkit works with any LLM. Head over to the original thread on r/PromptEngineering to read the full discussion and compare notes with other practitioners building multi-step agents. The conversation is worth your time.
Frequently Asked Questions
Q: Why do simple commands like “get page content” fail so often?
Many websites render content with JavaScript or load it lazily, meaning the first request returns an empty shell. Instead of giving up, escalate: wait for JS to stabilize, extract from the DOM, screenshot, scroll for hidden content. One fallback strategy dramatically improved the agent’s ability to actually retrieve pages.
Q: How does collecting evidence first actually reduce hallucinations?
When agents research and write simultaneously, they naturally “fill in” missing pieces with plausible-sounding information. By forcing the agent to collect and save all sources first (to JSON, then as Markdown files), you create an evidence checkpoint before synthesis. The agent can’t weave a narrative without the threads — it must write from what it actually found.
Q: Why is “research more” less effective than specific prompts?
Vague instructions like “research more” or “go deeper” don’t give your agent clear targets. Specific prompts work better: “Find 10 additional sources from domains not yet in links.json” or “Cross-reference these three revenue figures.” Specificity eliminates ambiguity and keeps the agent focused on concrete next steps.
Q: Does a bigger context window automatically mean better research?
Not quite. Larger context windows help, but they don’t fix the core challenge: inference quality. As one commenter noted, bigger contexts can actually amplify errors if the agent isn’t properly constrained. The real win is careful prompt design that forces evidence-based reasoning — not just more room to think.
Lessons from prompt engineering a deep research agent that scored above Perplexity on 100 PhD-level tasks
by u/K_Kolomeitsev in PromptEngineering