Many-shot Jailbreaking

Summary

This episode of The Turing Talks explores Many-Shot Jailbreaking (MSJ), a technique exploiting expanded context windows in large language models (LLMs) to prompt harmful behaviors. The study highlights the increasing effectiveness of MSJ with more examples, showing vulnerabilities across LLMs like GPT-4 and Llama 2, while standard defenses proved insufficient. Researchers stress the need for innovative mitigations, noting some promise in prompt-based defenses like Cautionary Warning Defense (CWD) to reduce attack success rates.

Sources

Join the discussion
0 / 300 characters
Comments

No comments yet