AMD MI355X Challenges NVIDIA: Low-Cost LLM Inference Win

AMD just closed a gap that NVIDIA has held for years. According to a technical writeup surfaced on Hacker News, the inference startup Wafer got GLM5.2 running on AMD’s Instinct MI355X at 2,626 tokens per second per node, hitting 80% of the performance they measured on an NVIDIA B200 while costing more than 2x less to serve. For anyone paying NVIDIA’s inference tax right now, that’s a number worth studying.

The backdrop matters. Inference demand is outrunning supply, frontier models are shipping almost weekly, and Blackwell GPUs are scarce and pricey. AMD’s MI355X runs about 2.75x cheaper per GPU than NVIDIA’s B300 with comparable specs, according to the Hacker News post. The catch has always been software. NVIDIA gets day-zero support for new models; AMD’s ROCm stack usually plays catch-up, and getting state-of-the-art performance out of the box is rare. What stands out here is how Wafer narrowed that gap without writing a single custom kernel.

📊 The benchmark numbers

Wafer tested a realistic workload: 20,000 input tokens, 1,000 output tokens, 60% cache hit rate. Throughput scaled cleanly as they pushed requests per second (RPS):

0.5 RPS: 449 tok/s/node, sub-second latency, 100% success
1.5 RPS: 1,913 tok/s/node, 100% success
2.0 RPS: 1,944 tok/s/node, 100% success
2.4 RPS (saturation): 2,626 tok/s/node, 100% success

They also clocked 213 tok/s single-stream on GLM5.2 following Artificial Analysis standards. That doesn’t top the leaderboard, but it wins on performance per dollar, which is the metric that actually shows up on your cloud bill.

🔧 How they did it

The methodology is a useful playbook for anyone running open models on AMD:

Quantization: They converted GLM5.2 from bf16 to MXFP4 using AMD’s Quark. Compared to z-ai’s official FP8 version, the MXFP4 build was effectively lossless on GSM8K, GPQA-Diamond, and tau2. One eval even ticked up.
Framework choice: They picked sglang over vLLM and ATOM. vLLM had no working MXFP4 path, and ATOM degraded at long context.
Speculative decoding: The ROCm image didn’t support it out of the box. Two small fixes, a naming mismatch on a shared expert weight and a missing ROCm guard on a CUDA include, unblocked it and delivered close to a 3x single-stream gain.
Prefill tuning: The workload was prefill-bound, so they switched from TP8 to TP4xDP2 and hand-tuned the MoE kernel selection, which sglang had silently left on a slow fallback path.

💡 Why it matters for practitioners

The strategic takeaway is that AMD’s software gap is now an engineering problem, not a hardware ceiling. The bugs Wafer hit were configuration and naming issues, not deep architectural limits. As coding agents get better at kernel and model optimization, the time to bring a new model up on ROCm keeps shrinking. That changes the math for anyone serving high inference volume.

If you’re running open-weight models at scale, the practical move is clear: benchmark AMD capacity on a performance-per-dollar basis, not just raw peak throughput. A chip that hits 80% of Blackwell’s speed at less than half the cost can win decisively on total spend, especially for single-node deployments where most real workloads still live.

⚠️ The limitations

Wafer is upfront about the boundaries. This study only covers single-node performance, so multi-node scaling is untested here. AMD’s raw numbers still trailed Blackwell, which hit 3,192 tok/s/node at 3.0 RPS on the same test. And the whole effort still required real engineering to route around framework bugs that NVIDIA users rarely see. AMD isn’t matching NVIDIA on ease. It’s matching it on value.

The direction of travel is the story. Every time optimization gets cheaper and more automated, AMD’s cost advantage gets easier to unlock. Full technical details, including the eval tables and config flags, are available at the original source.

Read original article

📊 The benchmark numbers

🔧 How they did it

💡 Why it matters for practitioners

⚠️ The limitations

Related: