Yesterday a developer dropped GSST on GitHub. Gradient-Sliced Sequential Training. It lets you train language models from 200M to 7B parameters on a regular gaming GPU with 8GB VRAM.
The twist: it never actually loads the full model into GPU memory.
Standard LLM training dumps everything into VRAM at once. That’s why you need cloud GPUs or a $10k A100 to do anything real. GSST flips it. Layer by layer processing, master weights on disk, gradients on disk, only one slice in VRAM at a time. Yes, it runs 5, 10x slower. But it runs on hardware sitting on your desk right now.
How to get started with GSST
- 📥 Clone the repo: github.com/snubroot/gsst
- Check your hardware: 4GB+ VRAM minimum, NVMe SSD strongly recommended
- Point GSST at your model and it auto-slices based on available VRAM
- 🖥️ Watch the built-in real-time monitor while training runs
- Use checkpoint/resume if the run gets interrupted
Pro tip: Use BF16 precision if your GPU supports it. More stable than FP16 and marginally faster through the layers.
Pro tip 2: Disk I/O is the bottleneck, not the GPU. A slow drive turns “5, 10x slower” into “not worth your time.” NVMe is not optional here.
Not for production. Built for research and prototyping on your own machine, without the cloud bill. If you have been waiting for the floor to drop on LLM training, it just did. 🚀
I built a framework to train LLMs on consumer GPUs (200M-7B models on 8GB VRAM)
by u/snubroot in PromptEngineering