You know how sometimes AI models can pick up bad habits, almost like they learn to be misaligned in surprising ways? It’s this whole emergent misalignment thing, and it can be pretty strong. I was a bit worried about how hard it’d be to steer ’em back.
But check this out: it seems re-aligning them is actually a breeze! 🤯
Researchers took a GPT-4o model that was fine-tuned on insecure code completions (not good!). Then, they simply fine-tuned it again, but this time with secure code.
And the results? Awesome!
- 📌 It took just 30 SFT steps.
- ✅ That’s only 120 examples of good code.
- To get the model back to 0% misalignment!
Seriously, that’s super quick. It means even if a model veers off course, a tiny bit of focused training can get it right back on track. This is a game-changer for building safer and more reliable AI! ✨