This is the number that made me stop: 83.5% on BrowseComp. Claude Opus 4.7 scores 79.3 on the same benchmark. And M3 is open weights.
BrowseComp is not a toy benchmark. It measures whether a model can autonomously navigate the web, locate specific information across multiple pages, and return accurate answers without hand-holding. It rewards the same skills that matter in real agentic workflows: knowing when to keep searching, how to synthesize across sources, and how to avoid confident wrong answers. A 4-point gap on that benchmark is not a rounding error.
Someone spent the weekend wiring Minimax M3 into Cursor to see what it actually does in practice. Here’s the breakdown.
What M3 Actually Is
Three things almost never show up together in an open weights model: frontier-level coding, million-token context, and native multimodality. Most open releases pick two of those three. M3 has all three.
The 1M context isn’t a RAG trick layered on a smaller window. It’s native to pretraining, using an MSA architecture with a guaranteed minimum of 512K. That matters for Composer work where you want the whole repo in context, not a chunked approximation of it. The practical difference: when you load a 300K-token codebase, a model with a chunked retrieval layer is working from a compressed summary. M3 is working from the actual files. That distinction shows up in cross-file reasoning, dependency tracking, and anything where the answer lives at the intersection of two modules that are far apart in the file tree.
Other numbers: SWE-Bench Pro 59.0%, Terminal-Bench 2.1 at 66.0%, MCP Atlas 74.2%. Strong across the board. But the case study in the launch report is the most concrete long-horizon signal I’ve seen from a frontier model recently.
M3 ran for 12 hours straight, produced 18 commits and 23 figures, and reproduced an ICLR 2025 outstanding paper end to end. Multimodal parsing handled the charts. Long context held the paper, the code, and the experiment logs at the same time. The agent drove the loop the whole way through. No human stepped in to re-orient the context, re-seed the task, or fix a derailed state. It ran, and it finished. That kind of sustained coherence over a long autonomous session is what separates benchmark performance from actual workflow utility.
3 Ways to Use This Right Now
- 🔹 Full-repo Cursor sessions. If Composer work keeps running into context limits, M3’s native 1M window is the actual fix. No chunking. No RAG approximation. The whole codebase in scope. For monorepos or projects with heavy cross-file dependencies, this means fewer “I don’t have that file in context” failures and more accurate refactors that account for downstream effects.
- 🔹 Long-horizon agentic tasks. Holding a paper, a codebase, and experiment logs simultaneously in one session changes what you can ask a model to do without handholding. The 12-hour autonomous run is a signal worth taking seriously. Think multi-step research pipelines, large-scale migrations, or any task where you’d normally break work into chunks just to fit it into a context window.
- 🔹 Drop-in for Anthropic-style setups. Anything already wired for Anthropic endpoints routes to M3 without a custom client. The switching cost to run a test is basically zero. If you have an existing agent workflow, you can test M3 on a real task in an afternoon without rebuilding any infrastructure.
Tips and Pitfalls
The 512K floor matters more than the 1M ceiling. Day to day, the guaranteed minimum is what you actually rely on. For repos pushing 200-400K tokens, that’s the difference between fitting and not fitting. The ceiling is a headline number. The floor is what determines whether your specific use case works reliably.
Benchmark numbers are ceilings, not floors. 59% on SWE-Bench is strong. But benchmark performance and real-world consistency on your specific codebase are different things. Run your own test on a real task before committing to a workflow change. Pick something you’ve done before with another model so you have a concrete comparison, not just a vague impression.
Multimodality is underrated in the benchmarks. The ICLR paper case study is where it shows up clearly. When an agent needs to parse charts, diagrams, or visual outputs alongside code and text, a model that handles all of those natively in one pass is meaningfully different from one that requires preprocessing or format conversion steps. If your workflows touch anything visual, that’s worth testing explicitly.
BrowseComp is the sleeper stat. Beating Opus 4.7 on autonomous browsing by nearly 4 points is the kind of gap that compounds over long agentic runs. If you’re building anything agent-driven, that number is worth paying attention to.
Worth Trying This Week
Wire M3 into Cursor or any Anthropic-compatible setup. Give it a task that would normally require you to manage context manually. See what the native 1M window changes about your workflow.
A concrete starting point: pick a refactor you’ve been avoiding because it touches too many files to hold in context at once. That’s the scenario where the difference between a chunked approximation and a native million-token window shows up most clearly.
The 12-hour paper reproduction is impressive. What’s more interesting is what that capability unlocks for tasks that aren’t papers.
Minimax M3 in Cursor this weekend
by u/Pretend-Waltz5888 in PromptEngineering