Sophon’s no-HBM chip bets on on-die DRAM

A design for an AI accelerator that throws out high-bandwidth memory entirely is making the rounds, and the numbers attached to it are bold enough to stop you mid-scroll. The chip is called PFG-1 “Sophon,” and according to Hacker News, where the writeup climbed to 174 points, it packs 330 GB of DRAM directly onto the die. No HBM stacks. No off-chip memory bottleneck. Just weights, gradients, and optimizer state sitting on the same silicon that does the math.

What stands out here is the architecture. Sophon stacks 32 logic tiers and 32 memory tiers in an alternating pattern, built on a 28 nm silicon base with a 2D transition-metal dichalcogenide (TMD) stack on top. In plain terms: instead of shuttling data back and forth between a processor and separate memory chips, each compute tile has its own private vertical connection to the weights it needs. That’s compute-in-memory, and it’s the whole pitch.

The claims

The spec sheet is aggressive. Here’s what the design targets:

  • 2,100 TFLOPS BF16, or 4,200 TFLOPS in FP8 inference mode
  • 330 GB capacity, enough to fit an 80B-parameter model plus optimizer state on-die
  • 38.7 tokens per second per watt on 80B FP8 decode, which the writeup pegs at roughly 174x an NVIDIA Rubin or AMD Instinct MI455X at low batch
  • A bill of materials around $8,358 per die

The efficiency gap comes from one place. At low batch sizes, GPUs like Rubin and the MI455X aren’t limited by raw compute. They’re starved by memory bandwidth, waiting on HBM4 that tops out around 19 to 22 TB/s. Sophon’s on-die memory delivers an estimated 191 to 214x the weight bandwidth of an HBM4 package. When you’re serving one stream at a time, that’s the number that matters, not peak FLOPS.

Why it matters

Memory is eating the AI hardware budget. The writeup cites a Morgan Stanley estimate that a single NVIDIA VR200 NVL72 rack runs about $7.8M, with HBM alone accounting for $2.0M of that, roughly a quarter of the rack. Cut the HBM line item and the economics shift hard. Sophon claims a hardware BOM nearly 10x lower than a Rubin setup and over 11x lower than the AMD part.

There’s also the train-then-serve angle. Because the on-die DRAM is fully read-write, the same chip handles forward and backward training passes and then turns around to serve inference, without swapping hardware. You could repartition a fleet between training and serving on demand. That flexibility is something the current GPU-plus-HBM world doesn’t offer cleanly.

The catch

Now the reality check. This is a design paper, not a chip you can buy. The 2D-TMD process it leans on, MoS2 and WSe2 transistors grown at the back end of the line, is real research but nowhere near volume manufacturing. Monolithic 3D stacking at 32 tiers with sub-100 nm inter-tier vias is the kind of thing that lives in lab demos and conference papers, not fabs shipping at scale. The thermal story, cooling a 22-micron stack pulling up to 749 watts during backward passes, is its own mountain to climb.

So treat the 174x figure as what it is: a projection from an architecture that hasn’t been built. Comparing a paper design’s best case against shipping silicon’s published specs is not an apples-to-apples fight.

Still, the direction is worth watching. The industry has spent two years pouring money into HBM precisely because memory bandwidth is the wall. Designs like Sophon argue the smarter move is to delete the wall, not climb it. Whether 2D-TMD monolithic 3D ever reaches production is an open question, but the pressure pushing engineers toward compute-in-memory is very real and getting stronger every quarter.

If even a fraction of these efficiency claims survive contact with a fab, the economics of training and serving large models look different. For now, it’s a sharp thought experiment about where AI silicon goes after HBM runs out of room. Full technical details, including the die cross-sections and tier-by-tier breakdown, are in the original Hacker News post.

Scroll to Top