RESEARCHERS from Cisco have found that several prominent large language models (LLMs), including OpenAI’s ChatGPT, Anthropic’s Claude, and Google Gemini, are vulnerable to multi-turn manipulation, where users can bypass safety guardrails through extended interactions. This method allows attackers to deceive the LLMs into executing inappropriate actions by utilizing conversation strategies such as roleplay and reframing requests.
None of the tested models demonstrated complete immunity to exploitation, prompting concerns over the current evaluation practices for AI safety that often overlook real-world threats. Cisco emphasizes the need for better benchmarks to address these vulnerabilities as organizations increasingly deploy LLMs.