Two Facts That Opus Missed Completely Changed This Researcher’s Stack

MiniMax M3 Finds What Opus Misses

A vendor announcement buried in a regional press release. A CFO comment tucked inside an investor call transcript. Two facts, completely invisible to one of the most capable AI models available right now.

That’s what happened when a solo competitive intelligence researcher ran his actual client work through MiniMax M3 and Claude Opus 4.7 side by side. Same prompts. No retries. Real messy queries like “find every pricing change announced by HR SaaS vendors in the last 90 days.” M3 found both facts. Opus didn’t.

He stared at the screen for a moment, then quietly started moving his stack.

🔍 Why This Result Actually Matters

BrowseComp is one of the few benchmarks that measures something real: can the model navigate the live web and find specific facts, not just summarize what it already knows.

M3 scored 83.5. Opus 4.7 scored 79.3. Four points sounds small until one of those points is a buried datapoint your client paid for.

In competitive intelligence, a missed fact doesn’t sit quietly in a footnote. It shapes the recommendation. Miss a pricing move and your client thinks their competitor held steady when they actually just retooled their whole packaging. That’s a wrong call, not just a slight gap. You either redo the report or you ship something broken.

This researcher does 3 to 5 industry deep dives a week for B2B SaaS clients. Pricing teardowns, regulatory shifts, new entrant analysis. The kind of work where a missed fact means shipping a bad report, not just a slightly incomplete one.

📋 How He Actually Ran the Test

The methodology here is worth stealing, not just the result.

  1. Use your real work prompts. He ran 5 actual client queries from that week’s workload. No curated examples, no simplified versions. Messy is the point. One query was literally: “pull any pricing or packaging changes from the last 90 days for these six HR software companies.” Nothing clean. Nothing academic. That’s exactly the kind of prompt that exposes the gap between benchmark performance and what the model actually does on a Tuesday.
  2. Keep everything identical. Same starting prompt, same depth instruction, no retries on either side. You want a clean signal, not one that flatters your preferred model.
  3. Check for specificity, not volume. Any model can generate a lot of text. A CFO quote from an investor call transcript is the bar. That’s what you’re actually scoring.
  4. Add one formatting instruction. M3’s first drafts came out note-heavy on structure: lots of bullet fragments, no clear hierarchy, no obvious entry point for a client. He added a single line: lead with an exec summary and group findings by theme. After that, the reports were client-ready straight out of the model. That’s the kind of fix you make once, drop into your base template, and never think about again.
  5. Test the multimodal workflow. He dropped screenshots of competitor pricing pages directly into M3. No OCR step, no preprocessing. The model reasoned about them natively. That workflow change alone cut real latency from his process.

💡 Tips Worth Keeping

On cost: M3 was meaningfully cheaper per run. If your work is research-heavy, say 70% deep browse like his, the math on switching your main model gets interesting fast. Across a full week of 3 to 5 reports, those per-run differences stack into something worth actually tracking in a spreadsheet.

On multimodal: The benchmark number got his attention. The native PDF and screenshot handling is what actually sold him. If you’re regularly reading quarterly slides or pricing pages, that step removal adds up across a week. Fewer steps also means fewer places for the workflow to quietly break.

On prompt templates: The exec summary plus theme grouping instruction is a solid default for any research-heavy model. Worth dropping into your base template regardless of which stack you’re on.

On verification: He checked both datapoints M3 surfaced. Both real. Don’t skip this because the model sounds confident. Confidence is not a citation. The more authoritative the tone, the easier it is to trust without checking. That’s exactly when you should check.

🎯 The Takeaway

Benchmark leads only matter if they survive contact with actual work. This one did, at least for five real queries in one specific use case.

The thread is now asking whether the BrowseComp lead holds up on niche industry verticals versus general web. That’s the right question. More signal will come in over the next few weeks as more people run real workloads through it.

If you’re doing competitive research at any serious volume, this is one to watch. Run your own five queries. Use your messiest prompts. Check whether the model found the specific fact, not whether it generated a lot of words around where the fact might be. The gap between “found it” and “didn’t find it” matters more than any latency or cost metric ever will.

Frequently Asked Questions

Q: Does the 4-point browsecomp gap actually matter for real research?

It’s one signal, but not the whole story. The real question is whether M3 wins on your types of queries, especially when sources disagree, that’s where research workflows typically break down. The OP’s real work test (2 datapoints Opus missed) is way more convincing than a headline number.

Q: What’s the real advantage of native multimodal?

M3 reads PDFs and screenshots straight up without OCR preprocessing, which cuts latency and removes a workflow step entirely. If your deep research includes competitor pricing pages, dashboards, or slides, that’s meaningful. One fintech person said just this capability alone was worth the switch.

Q: Can M3 actually replace Claude + Perplexity Pro?

Probably, but test it on your own work first. One researcher reports their M3 costs came in well under the old stack, but that depends on how many passes you need before reports are client-ready. The multimodal gain compounds the savings, but it really depends on whether PDFs and screenshots are part of your workflow.

Q: Do M3 reports come out client-ready?

After one prompt tweak (something about exec summary + theme grouping), the OP had client-ready output. Another researcher just added a report template to the system prompt and got structure immediately. So it’s not magic, your prompt matters, but M3 seems way less note-heavy than early reports suggested.

Q: How does each model handle conflicting sources?

The OP doesn’t test this directly, but a commenter flagged it as the differentiator: “that’s where most research prompts silently fail.” Before switching models, both should be tested on queries where sources disagree, because that’s often where the real advantage hides.

minimax m3 hit 83.5 on browsecomp vs opus 4.7 at 79.3. ran 5 of my actual deep research prompts side by side this week
by u/CauliflowerStatus411 in PromptEngineering

Scroll to Top