Learnable Activation Just Beat ReLU on 16 Benchmarks. Here’s the Code.

Yesterday a solo researcher from Venezuela published something that should be on every ML practitioner’s radar. It’s called Genal Activation, a learnable activation function that outperformed ReLU, GELU, and Swish across 16 separate benchmarks spanning image classification, medical diagnosis, reinforcement learning, and physics simulation. Not one domain. Sixteen. That’s not a cherry-picked win. That’s a pattern.

The twist: the activation function itself learns its own shape during training.

Most activations are static. ReLU is a ramp. GELU is a smooth curve. You pick one and hope it fits your task. The problem is that no single fixed shape is optimal across every loss landscape. A function that works beautifully on image classification might leave performance on the table for a regression problem or a physics-informed network. Genal flips that assumption entirely. Its shape parameter k is trainable, meaning it adapts to whatever problem you throw at it. Same math, different behavior per task. The network stops being constrained by your initial guess about which nonlinearity fits best.

The formula: Genal(x) = x · sigmoid(x/k), where k = softplus(θ) + ε. The trainable θ is what separates this from Swish, which uses a hardcoded k=1. Small change, surprisingly big results. The softplus wrapper keeps k strictly positive throughout training so the function never degenerates, and the ε term prevents numerical issues near zero. It’s a clean piece of math. Nothing exotic in the machinery, just one parameter freed from its cage.

Numbers that stood out

  • CIFAR-10 image classification: 85.11% vs ReLU’s 81.78%, nearly a 4-point gap on a well-studied benchmark where gains are hard to find
  • Parkinson’s detection: 97.44% vs ReLU’s 92.31%, in medical diagnosis that kind of margin translates to real missed cases
  • CartPole reinforcement learning: perfect 500 score vs Swish’s 447, not just better, maxed out
  • Navier-Stokes PDE solving: 3.04e-6 error vs ReLU’s 1.35e-4, two orders of magnitude, on a physics task where most people don’t think about activation functions at all

That last one is worth sitting with. Physics-informed neural networks are notoriously sensitive to how gradients flow through the architecture. The fact that a learnable activation consistently outperformed fixed ones there suggests the benefit isn’t just about smoother curves. The adaptability itself is doing something structurally useful across very different loss geometries.

How to drop it into your project 🔧

  1. Clone the repo: github.com/GenalFF/genal-activation, it’s lightweight, no heavy dependencies beyond PyTorch
  2. Pick your variant based on task type (see the pro tips below for the decision logic)
  3. Replace your current activation function with the Genal import, it’s a one-line swap in your model definition, no architecture changes needed
  4. Train normally. The shape adapts on its own 🧠, you don’t need a custom training loop or special learning rate scheduling for the θ parameter
  5. Benchmark against your baseline and compare, run the same hyperparameter config you already use so you’re isolating the activation change cleanly

One practical note on step 5: if you’re comparing against a strong baseline, run three seeds minimum. The adapting k parameter means results can vary slightly more than a fixed activation, especially in early epochs. The variance settles as the shape converges, but you want to see that in your specific setup before drawing conclusions.

Pro tip: The library ships four variants. Use GenalShift (adds a learnable shift β) for image classification tasks. That’s the one that hit 85.11% on CIFAR-10. Use GenalAdvanced (k per channel) for CNNs where you want more granular adaptation. Per-channel k means different feature maps can settle on different shapes, which is a real advantage when early and late convolutional layers are doing fundamentally different things.

Pro tip 2: GenalLeaky guarantees non-zero gradients throughout training, which makes it worth testing in deep architectures where vanishing gradients are a concern. If you’ve been fighting gradient flow issues in a very deep network and relying on careful initialization or batch norm tricks to compensate, this variant is the first one to try. It won’t fix a fundamentally broken architecture, but it removes one common source of silent degradation.

📄 Full paper is on Zenodo. No institution behind this, no lab funding. Just clean math and a public GitHub repo from someone who figured something out and shared it. The kind of work that used to require a university affiliation and a conference submission now ships as a GitHub link and a Zenodo preprint on the same day.

If you’re training neural networks for anything serious, swapping your activation function costs you one import line. The upside might be worth it 🚀

Genal Activation
by u/GeneTraditional8171 in PromptEngineering

Scroll to Top