So today, I had this super interesting session today about AI safety, and let me tell you, it was packed with insights. I figured I'd share some of the highlights with you because why not? Like, we’re building these systems to make life better but we also need to make sure they’re not plotting to take over the world. So this is a big topic and I recommend everyone read this. Adversarial Attacks Okay, so imagine you’re trying to mess with an AI by showing it something it’s supposed to understand but you tweak it juuust enough to confuse it. Like adding tiny noise to an image of a cat and suddenly the AI thinks it’s a dog. This is an adversarial attack. A common one we discussed is FGSM (Fast Gradient Sign Method), it uses the AI’s own gradients to trick it into failing. Or imagine there is a self-driving car and a bad actor figures out how to tweak just a tiny part of your input (like adding a sticker to a stop sign IRL), and suddenly your model thinks it’s a yield sign. Defending against this? It’s hard, but adversarial training (basically teaching the AI with these sneaky examples) helps. But open-source models are a tough nut to crack because attackers have all the access. Makes you wonder—are we really ready for this at scale? And solutions like adversarial training are cool, but it’s a constant cat-and-mouse game to keep up. Like, will we ever really outpace the attackers? IDK, but it’s intense. Machine Unlearning This part was wild. It’s like telling your AI to “forget” specific dangerous stuff. Say it knows how to generate unsafe instructions (like hacking tutorials or something), you can use methods like RMU (Representation Misdirection for Unlearning) to wipe out that knowledge while keeping the useful stuff. But also… super delicate. Mess it up, and you might “unlearn” things you actually need. The balance here is insane. Still, the concept of RMU was just *chef’s kiss.* AI Control Protocols This was probably my favorite. Imagine you have a sneaky, super-smart AI (untrusted model) and you pair it with a safety buddy (trusted model). The buddy’s job is to double-check outputs, flagging or fixing anything harmful. Two approaches stood out: