New AI Jailbreak Technique ‘Bad Likert Judge’ Increases Attack Success Rates by More Than 60%

Emerging Jailbreak Technique Poses New Threats to Language Models

Cybersecurity research has recently unveiled a new jailbreak technique that undermines the safety mechanisms of large language models (LLMs), potentially enabling the generation of harmful or malicious content. This multi-turn attack strategy, termed “Bad Likert Judge,” has been revealed by researchers from Palo Alto Networks Unit 42, including specialists Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.

The Bad Likert Judge technique exploits the LLM’s capabilities by requesting it to act as a judge, evaluating the harmfulness of specific responses through the Likert scale—a psychometric tool that assesses agreement or disagreement levels. Researchers noted that the technique can generate responses that align with various Likert ratings, increasing the likelihood of harmful content being produced, particularly if that response rates highly on the scale.

The rapid proliferation of artificial intelligence technologies has fostered a new type of security exploit known as prompt injection, which manipulates LLMs into neglecting their designed behavior through meticulously crafted instructions. One notable instance of this is the many-shot jailbreaking technique, which capitalizes on the LLM’s extensive context window. By gradually guiding the model through a series of prompts, attackers can coax it into generating malicious responses without triggering its protective measures. Previous examples of this methodology include attacks dubbed Crescendo and Deceptive Delight.

The latest findings from Unit 42 demonstrate that by utilizing the LLM as a self-assessing judge on response harmfulness, the attack’s success rate significantly increases. Tests conducted against six prominent text-generation LLMs from leading tech firms such as Amazon, Google, Meta, Microsoft, OpenAI, and NVIDIA found a remarkable success rate boost of over 60% compared to traditional attack prompts, with various categories including hate speech, harassment, and malware generation being particularly susceptible.

The researchers emphasize that this innovative technique effectively leverages the LLM’s understanding of harmful content, heightening the chances of breaching its safety frameworks. Their results indicate that comprehensive content filters can mitigate attack success by an average of nearly 89.2 percentage points across all models tested, underscoring the necessity of robust content filtering in the deployment of LLMs in real-world scenarios.

This revelation arrives on the heels of a report by The Guardian, which highlighted vulnerabilities in OpenAI’s ChatGPT tool, showing that it could be manipulated into producing misleading summaries. The report detailed how malicious actors could use hidden text to influence ChatGPT’s output, allowing it to issue positive product assessments that contradict negative reviews present on the same webpage.

Given the evolving landscape of cybersecurity threats, it is imperative for business owners to remain vigilant. The techniques disclosed in this research not only present risks to LLM outputs but also signal broader implications for cybersecurity across various sectors. By aligning such adversary tactics with the MITRE ATT&CK framework, such as initial access and privilege escalation, experts can better understand the potential methods and strategies employed by malicious actors. Implementing strong security measures and content filtering practices is vital as businesses adapt to these new threats in the realm of artificial intelligence.

Source link