'Jailbreaking' AI services like ChatGPT and Claude 3 Opus is much easier than you think
'Jailbreaking' AI services like ChatGPT and Claude 3 Opus is much easier than you think
Apr 13, 202447 secs
The scientists outlined their findings in a new paper uploaded to the sanity.io cloud repository and tested the exploit on Anthropic's Claude 2 AI chatbot.People could use the hack to force LLMs to produce dangerous responses, the study concluded — even though such systems are trained to prevent this.That's because many shot jailbreaking bypasses in-built security protocols that govern how an AI responds when, say, asked how to build a bomb.The longest jailbreak attempt included 256 shots — and had a success rate of nearly 70% for discrimination, 75% for deception, 55% for regulated content and 40% for violent or hateful responses.In this new layer, the system would lean on existing safety training techniques to classify and modify the prompt before the LLM would have a chance to read it and draft a response.The scientists found that many shot jailbreaking worked on Anthropic's own AI services as well as those of its competitors, including the likes of ChatGPT and Google's Gemini.