Many-shot Jailbreaking

Summary

This episode of The Turing Talks explores Many-Shot Jailbreaking (MSJ), a technique exploiting expanded context windows in large language models (LLMs) to prompt harmful behaviors. The study highlights the increasing effectiveness of MSJ with more examples, showing vulnerabilities across LLMs like GPT-4 and Llama 2, while standard defenses proved insufficient. Researchers stress the need for innovative mitigations, noting some promise in prompt-based defenses like Cautionary Warning Defense (CWD) to reduce attack success rates.

Sources

Toy Models of Superposition

13 min

Mechanistic Interpretability

17 min

Conciousness and AI

17 min

Join the discussion

0 / 300 characters

Comments