Lossless KV-cache Compression: 16.55x VRAM Savings

A new project called Grinder12 is making waves on Hacker News with a bold claim: lossless KV-cache compression down to 0.96 bits per value, delivering 16.55x VRAM savings during streaming inference. The post climbed to 160 points, putting KV-cache compression back at the center of practitioner conversation. If the numbers hold up under wider scrutiny, this is a meaningful shift for anyone running long-context models on constrained hardware.

What stands out here is the word lossless. Most VRAM-saving tricks for LLM inference accept some quality hit. Grinder12, as detailed in the Hacker News discussion, claims you can keep model outputs bit-for-bit identical while shrinking the cache more than 16x.

What’s a KV-cache and why does this matter

When a transformer generates text, it stores the attention keys and values for every token it has already processed. That stored tensor is the KV-cache. It grows linearly with context length and eats VRAM faster than the model weights themselves once you push past a few thousand tokens.

This is the single biggest reason long-context inference is expensive. A 70B model might fit on one GPU at short contexts, then spill across two or four cards once you stretch to 100K tokens, purely because of cache bloat. Cut the cache 16x, and the math changes.

The headline numbers

From the Grinder12 framing reported on Hacker News:

Bit width: 0.96 bits per cached value (down from the typical FP16 at 16 bits)
VRAM reduction: 16.55x
Quality loss: Zero. Lossless reconstruction
Mode: Streaming, meaning it works during ongoing generation rather than requiring batch preprocessing

For context, common KV-cache quantization schemes today land at INT8 (2x savings) or INT4 (4x savings), and both typically introduce measurable quality drift on harder benchmarks. A lossless 16x compression is in a different category.

How something like this is even possible

Lossless compression of high-dimensional tensors usually leans on entropy coding plus structural priors. KV-cache values aren’t uniformly distributed. They cluster, repeat patterns across heads, and have predictable dynamic ranges per layer. A scheme that exploits all three can push average bits per value below 1 without losing information, similar to how text compresses better than random bytes.

The streaming part is the harder engineering problem. You need to encode and decode on the fly, fast enough to not bottleneck token generation on the GPU. That’s where most academic compression ideas die in practice.

Practical implications for builders

If this approach proves out in production:

Long-context inference gets cheaper. Running 200K context on a single consumer GPU becomes plausible for models that previously needed datacenter hardware
Local LLM deployment opens up. Power users running 70B-class models at home on 24GB cards could see real headroom
Batch sizes go up. Less cache per request means more concurrent requests on the same hardware, which is the metric that actually moves serving costs
Agent workflows benefit most. Multi-turn agent loops accumulate cache fast. Compression here directly translates to longer agent sessions before VRAM pressure forces a reset

The caveats worth flagging

A Hacker News post is a starting point, not a peer-reviewed result. Things to watch:

Real-world latency overhead from the encode/decode loop
Whether the 16x figure holds across model architectures or only specific ones
Hardware compatibility, since some compression schemes lean on CUDA-specific kernels that don’t port to AMD or Apple Silicon
Independent reproduction on standard benchmarks

The community reaction so far is mixed but engaged, which is the right signal for an extraordinary claim. Lossless compression at sub-1 bit is the kind of result that either rewrites the playing field or quietly walks back its numbers under scrutiny. Worth tracking either way.

Full technical details are at the original Hacker News thread.

Read original article