NanoEuler: GPT-2 Class LLM Built From Scratch in C/CUDA

A developer has released NanoEuler, a GPT-2-class language model written entirely from scratch in C and CUDA, with no PyTorch, no autograd, and no machine learning libraries of any kind. According to Hacker News, where the project was shared as a Show HN post and pulled in 167 points, every piece of the pipeline is hand-built: the tokenizer, the forward and backward passes, the CUDA kernels, and the full training loop. What stands out here is the ambition. This isn’t a wrapper around existing tools. It’s an attempt to own every parameter, every gradient, and every line of the engine.

The headline model is around 116 million parameters and trains on a single RTX 4070. That’s a consumer GPU, not a data center cluster. As detailed in Hacker News, the project also ships a tiny CPU showcase model (about 1 million parameters) that runs on plain libm and OpenMP, so you can poke at the whole thing without a GPU at all.

What’s under the hood

NanoEuler uses the same building blocks you’ll find in current frontier models, just implemented by hand:

Decoder-only transformer with RMSNorm (pre-norm, no bias) and no biases anywhere
Rotary position embeddings (RoPE) applied to queries and keys
SwiGLU feed-forward layers
Grouped-query attention (GQA), where query heads share a smaller set of key/value heads
Multi-token prediction (MTP), where extra output heads predict the next K tokens to sharpen the learned representation and enable speculative decoding
A hand-written byte-level BPE tokenizer with GPT-2-style pretokenization and a 4,096-token vocabulary
A hand-written FlashAttention kernel and cuBLAS matmuls on the GPU side

The name is a math joke with a real point behind it. A residual connection, x = x + f(x), is exactly one step of the forward-Euler method for solving an ordinary differential equation. Read that way, a deep residual network is a discretized ODE, where depth is integration time. The project tips its hat to Leonhard Euler, who gave us that integration method.

Why the engineering is the product

The most useful part of NanoEuler isn’t the chatbot. It’s the verification. Hand-written backpropagation is easy to get subtly wrong, so every analytic gradient is checked against a central finite difference in double precision. Run make check and it validates the backward pass for each parameter tensor, including the trickier ones for RoPE, SwiGLU, GQA, and MTP. The reported max relative error is about 1e-4, well inside the passing threshold. There are no external dependencies, and it’s tested with gcc 13 on Linux.

The commands are refreshingly simple. make builds the training binary, ./nanoeuler train runs the small model, ./nanoeuler train big handles the larger GPU model, and ./nanoeuler chat opens a REPL where you type a prompt and watch the model continue it.

The honesty section matters

The author is blunt about what this is and isn’t. NanoEuler is a research and educational artifact, built in public. At 116M parameters trained on a single GPU, it produces what the project calls “fluent-ish English” with no real world knowledge. The fine-tuned chat model answers in assistant form, but its content is shallow. As the author puts it, the chat model “demonstrates that the pretrain to SFT pipeline works end to end, it is not a useful chatbot.”

That framing is the right one. A genuinely capable assistant needs orders of magnitude more parameters, data, and compute. The author notes that even a 135M model only becomes a basic assistant after roughly 600 billion training tokens, far beyond what one GPU and a small corpus can deliver. Supervised fine-tuning is done, with RLHF and DPO listed as planned.

Why it matters

Most people learning how large language models work never see past the abstraction layer. NanoEuler strips it away. If you want to understand how a modern decoder-only transformer actually trains, gradient by gradient, this is a complete and readable map of the territory. It won’t replace your assistant, and it doesn’t try to. It tries to be understandable, and that’s a rarer thing.

You can find the full repo, sample outputs, and the gradient-check details in the original Hacker News post.

Read original article

What’s under the hood

Why the engineering is the product

The honesty section matters

Why it matters

Related: