Robustness, unlearning and control

Summary

In this episode of The Turing Talks, we dive into the latest research on removing hazardous knowledge from large language models (LLMs). We explore "The WMDP Benchmark," a new tool designed to assess LLMs' expertise in critical areas like biosecurity, cybersecurity, and chemical security. The episode also introduces RMU, a novel unlearning technique that strips harmful information without affecting overall model capabilities. We then turn to "Deep Forgetting & Unlearning," which tackles the challenge of eliminating unwanted knowledge embedded in LLMs’ internal workings. Join us for insights into safer, more responsibly scoped AI.

Sources

Toy Models of Superposition

13 min

Mechanistic Interpretability

17 min

Conciousness and AI

17 min

Join the discussion

0 / 300 characters

Comments