Seasons / AI Safety Fundamentals

Episodes

Contributing to AI safety

10 min

In this episode of The Turing Talks, we explain how you can actively contribute to the field of AI alignment, ensuring that AI systems align with human values and goals. We explore various career opportunities in research, engineering, and policy, while emphasizing the importance of mastering both machine learning and alignment concepts. You'll also discover project ideas on assessing AI risks, improving alignment techniques, and advancing AI safety. Plus, we share tips on interdisciplinary collaboration and advice for navigating the AI alignment community.

Technical governance approaches

17 min

In this episode of The Turing Talks, we dive into emerging strategies for AI governance, including the regulation of compute power—a measurable way to oversee AI development. We also discuss the potential of watermarking AI-generated content to combat disinformation, though its effectiveness is debated due to easy circumvention. Additionally, the conversation covers the critical need for more robust AI model evaluations and the push for a “Science of Evals” to better assess AI risks and capabilities. Tune in for insights on steering AI toward safe and responsible use.

Mechanistic Interpretability

17 min

In this episode of The Turing Talks, we explore the rising field of mechanistic interpretability, which seeks to unravel how neural networks internally process decisions. As AI systems enter high-stakes fields like healthcare and finance, understanding their inner workings is crucial for detecting biases and ensuring alignment with human goals. We discuss key challenges like polysemanticity—where neurons represent multiple features—and how researchers are addressing this with techniques like sparse autoencoders. Tune in for insights into the future of AI transparency and safety.

Robustness, unlearning and control

11 min

In this episode of The Turing Talks, we dive into the latest research on removing hazardous knowledge from large language models (LLMs). We explore "The WMDP Benchmark," a new tool designed to assess LLMs' expertise in critical areas like biosecurity, cybersecurity, and chemical security. The episode also introduces RMU, a novel unlearning technique that strips harmful information without affecting overall model capabilities. We then turn to "Deep Forgetting & Unlearning," which tackles the challenge of eliminating unwanted knowledge embedded in LLMs’ internal workings. Join us for insights into safer, more responsibly scoped AI.

Scalable Oversight

17 min

In this episode of The Turing Talks, we tackle the critical challenge of aligning powerful artificial intelligence (AI) systems with human values. We discuss innovative approaches from three key sources. First, “AI Safety via Debate” presents a method where AI engages in debates with humans to ensure accurate and relevant information delivery. Next, “Supervising Strong Learners by Amplifying Weak Experts” explores how training AI to break down complex tasks into simpler subtasks can enhance understanding. Finally, “Weak-to-Strong Generalization” investigates the potential of using “weak” AI models to supervise and unlock the capabilities of “strong” AI models. Together, these sources highlight the need for robust techniques to keep increasingly powerful AI systems aligned with our values. Join us for a thought-provoking discussion on the future of AI alignment and its implications for society.

Reinforcement learning from human (or AI) feedback

8 min

In this episode of The Turing Talks, we dive into the intriguing world of Reinforcement Learning from Human Feedback (RLHF) and its role in training Large Language Models (LLMs). We break down the three key steps of RLHF: collecting human feedback, training a reward model, and optimizing AI systems for maximum rewards. Discover the advantages of RLHF in communicating complex goals without manual reward design, reducing the risk of reward hacking. We also address the challenges of obtaining quality feedback, reward model misspecification, and biases in policy optimization. Additionally, we introduce Constitutional AI (CAI) as a novel approach to improve RLHF, utilizing a set of human-written principles to enhance AI behavior. Join us for a comprehensive overview of RLHF, its limitations, and how CAI can lead to safer and more transparent AI development.

What is AI alignment

9 min

In this episode of The Turing Talks, we explore the potential risks and benefits of artificial intelligence (AI), with a focus on the development of superintelligent AI. We examine the implications of AI systems potentially exceeding human intelligence and the challenge of keeping these systems aligned with human values. Our discussion covers significant risks, including AI malfunctions, discrimination, social isolation, privacy breaches, and disinformation. We also address complex issues like worker exploitation, bioterrorism, authoritarianism, and the loss of control over advanced technologies. Our guests emphasize the need for AI safety research to be prioritized alongside development, ensuring a beneficial future for humanity. Join us for an engaging exploration of the AI landscape and the balance between innovation and responsibility.

AI and the Years Ahead

14 min

In this episode of The Turing Talks, we dive into the explosive growth of training compute in artificial intelligence, with a special focus on deep learning. Our discussion unfolds across three insightful sources. First, we spotlight groundbreaking advancements in AI domains like image generation, game playing, and language processing, all fueled by enhanced computing power and vast datasets. Next, we explore the economic drivers behind AI, revealing how companies are striving to develop systems that could replace human labor and unlock immense economic value. Finally, we take a historical perspective on computing trends in machine learning, charting three distinct eras of compute growth. Join us as we unpack these critical insights, emphasizing the need for responsible governance in the face of powerful AI technologies. Tune in for an engaging exploration of AI’s past, present, and future!