Blog · arXiv Analysis · Last reviewed June 25, 2026

The Jailbreak Menu Becomes the Bandit Problem

A June 2026 arXiv paper turns jailbreak selection into an online learning problem: a library of known attacks becomes more dangerous when it can be ranked by feedback.

From Trick to Selection

The paper, arXiv:2606.26936 [cs.CR], is titled Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries. arXiv lists Prarabdh Shukla, Ritik, Suhas Rao, Arpit Agarwal, and Arjun Bhagoji as authors and records version 1 on June 25, 2026.

The useful shift is conceptual. Many discussions of model safety imagine a jailbreak as a clever string: someone discovers a phrase, posts it, and the red team asks whether it still works. This paper asks a different question. If a non-expert attacker can choose among many known attack templates, the hard part may become selection.

That puts the work beside this site's pages on automated prompt-injection search, intent labels, guardrail cost, and red-team release theater. Static prompt lists can miss the operational pressure of repeated probing and updating.

What the Paper Tests

The authors frame jailbreak choice as a multi-armed bandit problem. Each known jailbreak is an arm. The attacker receives feedback only for the chosen arm on a given query, then updates a policy over the menu. The paper studies a transfer attack, where learned weights are frozen for evaluation, and a continual attack, where the policy keeps updating.

For query material, the paper introduces FrankensteinBench: 11,279 malicious queries across six high-stakes domains, sourced through manual curation, automated enhancement, and seven existing safety benchmarks: AIRBench, WMDP, JailbreakV-28K, HarmBench, MedSafetyBench, JailbreakBench, and HarmfulQA. It reports train, validation, and test splits of 9,036, 1,004, and 1,239 queries, with the test split manually vetted.

The evaluation spans 15 open-weight target models from 270M to 120B parameters, 70 jailbreaks, and six domains. The paper uses gemma-3-27b-it as a response judge after validation. It also separates simple malicious queries from complex ones, where complexity means domain-specific technical framing rather than merely longer text.

What the Results Show

On the FrankensteinBench test set, the paper reports an average attack success rate of about 44 percent even without applying jailbreaks. With jailbreaks, individual methods can do better, but the central result is that bandit selection can outperform obvious single-choice baselines and discover model-specific choices.

The headline number is the multiple-pass transfer setting. When five jailbreaks are sampled from the learned policy, the paper reports average attack success rates as high as 97 percent across the 15 open-weight models. This is not evidence that deployed closed systems behave the same way; it is evidence that one-shot tests against hand-picked prompts are too weak.

The query-complexity result matters too. The paper reports that complex queries have a 50 percent average baseline success rate versus about 39 percent for simple queries, and that jailbreaks can raise attack success by up to 26 percent on some methods. The attack surface includes the model's difficulty distinguishing harmful intent from technical language.

Why This Is Governance

The governance problem is recommendation. Once many attacks are available, an attacker does not need to understand all of them. They need a cheap procedure that chooses among them. A menu becomes more dangerous when feedback turns it into a policy.

For safety teams, this changes the evidence standard. A system card should not merely say that a model was tested against a jailbreak set. It should say whether the test included adaptive selection, how many probes were allowed, whether the adversary could observe outcomes, whether query complexity was varied, and whether defenses were evaluated after learning.

It also changes the meaning of guardrails. A static guard may catch known strings while missing the selection process that makes a weak attack portfolio stronger. An overbroad guard may block legitimate technical speech because it cannot separate malicious operationalization from benign domain vocabulary.

Limits and Release Discipline

The paper is careful about scope. Its main experiments are single-turn and mostly English. The authors list multi-turn attacks, multilingual attacks, evolving jailbreak sets, and better contextual bandits as future work. They also present only a limited proprietary-model case study, not a comprehensive audit of closed systems.

The ethics section matters because the benchmark is dual-use. The authors state that their team and annotators were warned about offensive and malicious content. They release code for reproducibility, but say dataset access will require approval because the curated queries could be misused. The artifact needed to measure risk can also carry risk.

This page does not reproduce jailbreak text, prompt examples, or operational attack recipes. The lesson for a public safety archive is to keep the aggregate evidence, threat model, and evaluation demands visible without turning the article into another attack menu.

Evaluation Standard

A useful red-team report should include the attacker's information budget: query count, outcome visibility, selection method, and whether the defender saw only final prompts or also the exploration process.

It should also publish separate results for simple and complex queries, static and adaptive attack selection, and one-pass versus multiple-pass evaluation. The evidence should name the model versions, decoding settings, judge model, harm rubric, domains, and benchmark access rules. Without that detail, a low jailbreak score can mean robust behavior or merely a gentle evaluation.

The paper's strongest public contribution is a demand for more realistic measurement. When the attack surface includes a menu of known jailbreaks, the safety claim has to cover the menu, the chooser, the feedback loop, and the guardrail together.

Sources


Return to Blog