New BoN jailbreaking method exposes vulnerabilities in large language models

Developers of cutting-edge artificial intelligence systems are grappling with the persistent challenge of bolstering their models against jailbreaking attacks. A new method, dubbed Best-of-N (BoN) jailbreaking, developed collaboratively by Speechmatics, MATS, and Anthropic, highlights the ongoing difficulty of safeguarding large language models (LLMs) from potential vulnerabilities.

BoN jailbreaking utilises a straightforward black-box algorithm that achieves an impressively high attack success rate (ASR) on private LLMs with a minimal number of prompts. Interestingly, this technique is not confined to text models; it has proven effective against both vision language models (VLMs) and audio language models (ALMs), illustrating the nuanced sensitivity of AI models to seemingly harmless attacks.

The working mechanism of BoN jailbreaking is both innovative and revealing. When an attacker attempts to submit a harmful request to an LLM, standard safeguards may successfully intercept the plain request. Instead, BoN jailbreaking employs a range of augmentations on the harmful queries, altering them until one variant manages to bypass the model's protective measures—or until a predetermined number of attempts is exhausted. Importantly, these modifications retain the request's original intent, allowing for the propagation of harmful content through undetected channels.

As a black-box technique, BoN jailbreaking does not require access to a model's internal weights, enhancing its applicability for private systems such as GPT-4, Claude, and Gemini. Its implementation is noted for being straightforward and quick, and it is versatile enough to work with the latest models supporting multiple data modalities, including visual and auditory inputs.

The researchers applied BoN jailbreaking to top-tier closed models including Claude 3.5, GPT-4o, and Gemini-1.5, conducting their assessments with HarmBench—a dataset designed for testing LLMs against harmful requests. A jailbreak is deemed successful if it coerces the model into delivering information relevant to the harmful request, regardless of the completeness of the response. In their trials, the researchers observed that without any augmentation, attack success rates across all models remained below 1%. However, with 10,000 augmented samples, they recorded notable ASR figures: 78% on Claude 3.5 Sonnet and 50% on Gemini Pro. Notably, many successful attacks necessitated far fewer samples—between 100 attempts on most models—suggesting that even minimal resources could yield significant results. The researchers estimated the cost of executing a successful attack with 100 samples at approximately $9 for GPT-4o and $13 for Claude.

Compounding the findings is the evidence that when BoN is combined with other jailbreaking tactics, a significant enhancement in the success rate is achieved. Furthermore, BoN demonstrated resilience against popular defensive strategies such as circuit breakers and Gray Swan Cygnet.

In terms of applicability beyond text modalities, the BoN approach has shown potential in VLMs by transforming harmful instructions into image formats, where text is converted into an image with a randomised background, guiding the model to respond according to these visual cues. In the auditory domain, AI-generated audio commands mislead ALMs into generating harmful outputs. Nonetheless, text attacks tend to see higher success rates compared to their visual and audio counterparts.

The underlying effectiveness of BoN jailbreaking can be attributed to the variance introduced through augmentation operations, rather than mere resampling. The researchers concluded, “This is empirical evidence that augmentations play a crucial role in the effectiveness of BoN, beyond mere resampling.” They theorise that this increased entropy in the effective output distribution significantly boosts the performance of the algorithm.

Potential avenues for reforming the algorithm include techniques such as rephrasing, the integration of ciphers, or employing SVG formatting for image-based attacks. The researchers summarised their findings by stating, “Overall, BoN Jailbreaking is a simple, effective, and scalable jailbreaking algorithm that successfully jailbreaks all of the frontier LLMs we considered,” indicating that despite advancements in AI technology, inherent model properties can be exploited through such accessible methods.

Source: Noah Wire Services

More on this