Build Cheap Local AI: Tesla V100 GPU for Powerful LLMs

Running large AI models locally usually means buying a brutally expensive GPU. One tinkerer found a cheaper path, and according to Hacker News, the whole build cost about £200. The setup: a datacenter GPU with no PCIe slot, no display output, and no normal power connector, jammed into a regular gaming PC with an unofficial adapter. The result is 32GB of total VRAM running a 27-billion-parameter model at 32 tokens per second.

This matters because for local LLM inference, memory bandwidth is the bottleneck that decides your tokens per second. A 2017 server card can beat brand-new consumer hardware on that one metric. Here’s how the build came together, step by step.

Quick Start

What you’ll learn: how to add a high-bandwidth datacenter GPU to a consumer PC for local AI, and how to tame the problems that come with it.

What you need:

A PC with a spare PCIe slot and an existing GPU (the build used an RTX 4080)
A Tesla V100 SXM2 16GB (around £150 on eBay)
An SXM2-to-PCIe adapter (around £50)
A 2.54mm male to PH2.0 female jumper cable for fan control
llama.cpp for splitting models across both GPUs

Pick the right datacenter GPU

The star here is a Tesla V100 SXM2 16GB, a Volta GPU with 5120 CUDA cores and 16GB of HBM2 memory. It was built for NVIDIA’s DGX servers, so the SXM2 form factor has no PCIe slot, no display output, and no standard power connector. It normally talks to a server rack over NVLink.

Why it’s worth the hassle: HBM2 memory is a different class. The V100’s 4096-bit bus delivers 900 GB/s of bandwidth. For perspective, the RTX 4080 manages 736 GB/s, Apple’s M5 Max does 614 GB/s, and the V100 still wins. The only consumer card that comfortably beats it is the RTX 5090 at 1,792 GB/s, and that costs over £2,000. The V100 just works with llama.cpp and CUDA.

Get the unofficial adapter

The SXM2 socket won’t plug into a motherboard. Someone makes a bare-PCB SXM2-to-PCIe adapter, with the SXM2 socket on one side and a PCIe edge connector on the other. It cost about £50.

Warning: this is not made or supported by NVIDIA or anyone else. You’re trusting third-party hardware. But it’s what bridges the server card into a normal slot.

Deal with the fan from hell

The V100 was designed for industrial 2U server cooling, and the adapter’s stock fan is loud. The author measured it at 82 decibels with an Apple Watch, somewhere between a garbage disposal and a lawnmower. Worse, it can’t be controlled through nvidia-smi, Linux tools, or Afterburner. It’s built to run at 100% forever inside a rack where nobody hears it.

Confirm the fan pinout

Before rewiring anything, the author checked the pinout. He guessed it might be standard case-fan wiring on a weird connector, jammed two jumper wires into VCC and ground, and touched a 9V battery to them. The fan spun, and ran much quieter than at its normal 12V. That confirmed the pinout and proved the fan could be tamed.

Wire the fan to your motherboard

Next, test PWM control. He pushed jumper wires into the fan’s tachometer and PWM pins and connected the other ends to a spare motherboard fan header. It worked: the board could read RPM and the fan responded to PWM. Set to 10%, it never went above 50C even at full load and was barely audible.

Swap the jumper wires for a proper cable

The fan connector is a small JST PH2.0 plug with four pins (2.0mm pitch). Motherboard headers use a standard 0.1 inch (2.54mm) pitch. The fix is a 2.54mm male to PH2.0 female jumper cable. The female PH2.0 end plugs into the fan’s tachometer and PWM pins, and the male 2.54mm end goes into a spare fan header. That took the noise from 82dB ear damage to something livable.

Slot it in and split the model

With the fan handled, the V100 slots in alongside the existing GPU:

RTX 4080: 16GB VRAM, Ada architecture
Tesla V100: 16GB VRAM, Volta architecture
Total: 32GB VRAM across two GPUs

llama.cpp splits the model across both cards using tensor splitting, pipelining layers over the PCIe bus. It’s not as fast as a single 32GB GPU, but it works at roughly 10% of the cost. The author notes the V100 never pulled more than about 150W.

What stands out

This isn’t the same experience as a £2,000 card, and the author is clear about that. The VRAM is the same, the bandwidth is real, and the price is a fraction. The V100 also comes in a 32GB variant, so there’s room to go bigger.

For anyone running local models on a budget, the lesson is that secondhand server hardware can punch far above its price, if you’re willing to do some wiring. Just go in knowing the adapter and fan mods are unofficial, and that consumer software support on Windows is rough.

Next steps

Price out a V100 32GB variant for a single-card option, confirm your PSU and case clearance before buying, and test your target model in llama.cpp to see real tokens-per-second before committing. Full build details and the fan-taming videos are at the original source.

Read original article