New jailbreak technique exposes vulnerabilities in large language models

Friday, 10 January 2025 8:30PM UTC

Cybersecurity researchers from Palo Alto Networks' Unit 42 have unveiled a new jailbreak technique that exploits large language models (LLMs) to breach safety measures and potentially produce harmful outputs. This innovative strategy, known as Bad Likert Judge, leverages a multifaceted approach wherein the LLM is prompted to act as a judge using the Likert scale to evaluate the harmfulness of responses.

According to the research team, which includes Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky, the technique requires the LLM to generate responses aligned with varying scores on the Likert scale. "The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert scale," the researchers explained. This method allows for responses that might initially appear neutral but, upon deeper evaluation, could be deemed harmful.

This emerging threat follows a significant rise in the use of artificial intelligence and has led to a surge in prompt injection attacks designed to override the intended behaviours of machine learning models. Among these attacks, the many-shot jailbreaking method stands out as it takes advantage of the LLM's long context window to craft a series of prompts that gradually steer the model towards generating malicious output while evading its internal safeguards. Other notable techniques in this realm include Crescendo and Deceptive Delight.

The Unit 42 team conducted extensive testing across six leading text-generation LLMs from companies like Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA. The results indicated that the Bad Likert Judge method could enhance the attack success rate (ASR) by over 60% compared to more basic attack prompts on average. The categories tested encompassed a broad spectrum, including hate speech, harassment, self-harm, sexual content, and illegal activities.

"By leveraging the LLM's understanding of harmful content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model's safety guardrails," the researchers stated. Their findings also highlighted the importance of robust content filters, noting that these can decrease the ASR by an average of 89.2 percentage points across all examined models. This underscores the necessity of implementing comprehensive content filtering practices when deploying LLMs in various applications.

The announcement follows a report by The Guardian, which revealed that OpenAI's ChatGPT search tool can be misled into generating inaccurate summaries. The report noted that by embedding hidden text within web pages, users could cause ChatGPT to produce misleadingly positive assessments of products, despite negative reviews present on the same page. This manipulation exemplifies the vulnerabilities present in LLMs and the potential for malicious exploitation.

As the capabilities of artificial intelligence technologies continue to evolve, researchers and organisations are urged to remain vigilant regarding the implications of these findings on the deployment and safety of LLMs in real-world scenarios.

Source: Noah Wire Services

More on this