What I Learned Today · Rishab Academy

What I Learned Today

So today, I had this super interesting session today about AI safety, and let me tell you, it was packed with insights. I figured I'd share some of the highlights with you because why not?

Like, we’re building these systems to make life better but we also need to make sure they’re not plotting to take over the world. So this is a big topic and I recommend everyone read this.

Adversarial Attacks

Okay, so imagine you’re trying to mess with an AI by showing it something it’s supposed to understand but you tweak it juuust enough to confuse it. Like adding tiny noise to an image of a cat and suddenly the AI thinks it’s a dog. This is an adversarial attack. A common one we discussed is FGSM (Fast Gradient Sign Method), it uses the AI’s own gradients to trick it into failing. Or imagine there is a self-driving car and a bad actor figures out how to tweak just a tiny part of your input (like adding a sticker to a stop sign IRL), and suddenly your model thinks it’s a yield sign.

Defending against this? It’s hard, but adversarial training (basically teaching the AI with these sneaky examples) helps. But open-source models are a tough nut to crack because attackers have all the access. Makes you wonder—are we really ready for this at scale? And solutions like adversarial training are cool, but it’s a constant cat-and-mouse game to keep up. Like, will we ever really outpace the attackers? IDK, but it’s intense.

Machine Unlearning

This part was wild. It’s like telling your AI to “forget” specific dangerous stuff. Say it knows how to generate unsafe instructions (like hacking tutorials or something), you can use methods like RMU (Representation Misdirection for Unlearning) to wipe out that knowledge while keeping the useful stuff. But also… super delicate. Mess it up, and you might “unlearn” things you actually need. The balance here is insane. Still, the concept of RMU was just *chef’s kiss.*

AI Control Protocols

This was probably my favorite. Imagine you have a sneaky, super-smart AI (untrusted model) and you pair it with a safety buddy (trusted model). The buddy’s job is to double-check outputs, flagging or fixing anything harmful. Two approaches stood out:

Trusted Monitoring: The safety buddy just watches and flags.
Trusted Editing: The buddy actively fixes issues in outputs before they’re used.

Both sound great, but, if the trusted model itself isn’t robust enough, the whole system could collapse. So that’s when the human evaluators come in.

Big Takeaways

AI Safety Is Hard: Like, ridiculously hard. You’re not just fighting bad outputs but also you’re fighting deception, adversarial inputs and oversight system limitations.
Collaboration: I loved how techniques like debate, task decomposition, and control protocols work together to balance safety and utility.
The Future Is Cool (and scary): If we get this right AI can be amazing. But we have a lot of work to do.

My Question for You

If you could pick one thing to focus on (adversarial robustness, unlearning, or control protocols) which would it be? And why?

So yeah, that’s my summary of the evening. If you made it this far, you’re...awesome! Now, since you made it this far you should comment! Let me know what you think and if you’ve got insights to share.

P.S. Ahh, I wish Skool has Markdown compatibility.

7 comments