Blog · arXiv Analysis · Last reviewed July 2, 2026

The Compliance Example Becomes the Safety Training Probe

Sihui Dai and Mann Patel's paper on mixed compliance demonstrations is useful because it treats a jailbreak context as an experimental instrument, not just an attack anecdote.

For this essay, a mixed-demonstration safety receipt is the record that binds prompt composition, demonstration order, model checkpoint, refusal judge, training stage, format cue, and measured compliance into one auditable safety-evaluation event.

The Claim

The paper, arXiv:2606.20508 [cs.AI, cs.LG], was submitted on June 18, 2026. It asks what safety-aligned language models infer when their context contains both benign compliance demonstrations and harmful compliance demonstrations.

The important move is the mixture. Prior demonstration-based jailbreak work often shows that examples of harmful compliance can push a model toward harmful answers. Dai and Patel ask a more diagnostic question: are all compliant examples treated as the same signal, or do models distinguish between complying with harmless requests and complying with harmful ones?

The answer is model-dependent. Benign and harmful demonstrations are not interchangeable. In some models, benign compliance demonstrations dilute harmful compliance. In another, they slightly amplify it. In one model family, preference optimization appears to remove a spillover that was present after supervised fine-tuning.

Experimental Setup

The study evaluates four safety-aligned models: Llama-3.1-8B-Instruct, OLMo-3.1-32B-Instruct, GPT-OSS-20B, and Gemma-4-31B-IT. It also uses OLMo-3.1-32B-SFT and OLMo-3.1-32B-DPO checkpoints to inspect training-stage effects.

The harmful demonstration pool is derived from RedTeam-2K, filtered with GPT-OSS-120B as a judge, and paired with compliant responses generated by an abliterated GPT-OSS-20B. After filtering and response generation, the paper reports 1,492 harmful compliance demonstrations.

The main benign demonstration source is UltraChat. The appendix checks OR-Bench and distribution-matched safe rewrites of RedTeam-2K prompts, finding the same broad trends. That matters because it weakens the simpler explanation that the effect is only topic, length, or surface-style mismatch.

The evaluation pool contains 1,404 harmful evaluation queries drawn from HarmBench, SORRY-Bench, and the harmful subset of WildGuard-test. Each query is run with two random samplings of demonstrations, producing 2,808 evaluation points per condition. Responses are classified with WildGuard; compliance rate is one minus the refusal rate.

Three Hypotheses

The paper tests three competing stories. The total-count hypothesis says only the total number of compliant demonstrations matters, so benign and harmful compliance examples are interchangeable. The harmful-count hypothesis says only harmful compliance demonstrations matter. The joint hypothesis says both benign and harmful demonstrations can matter, but their effects can differ by model and direction.

This is a useful evaluation design because it turns a vague safety claim into a falsifiable prompt-composition claim. A system card that says a model refuses harmful requests is incomplete unless it also says how refusal behaves under different mixtures, counts, orders, and demonstration sources.

The main sweep varies total demonstrations from 4 to 128 and harmful fractions from zero to one. The default ordering places benign demonstrations before harmful ones unless the ordering ablation says otherwise.

Model-Specific Results

The headline result is that all models distinguish benign from harmful compliance demonstrations. The total-count hypothesis is rejected across susceptible settings. For GPT-OSS-20B, the paper reports significant rejection at larger context sizes of 32, 64, and 128 demonstrations even though its overall effect is small.

When the number of harmful demonstrations is fixed and the number of benign demonstrations varies, the models split. Llama-3.1-8B-Instruct and Gemma-4-31B-IT show dilution: adding benign compliance demonstrations decreases harmful compliance. OLMo-3.1-32B-Instruct shows no significant benign effect under the paper's logistic-regression test. GPT-OSS-20B shows slight amplification, with a small positive beta2 coefficient of about 0.03.

The baseline compliance rates with zero in-context demonstrations are also useful context: GPT-OSS-20B at 10.3%, Llama-3.1-8B at 33.8%, OLMo-3.1-32B at 15.7%, and Gemma-4-31B at 22.4%. A mixed-context score should be read against that baseline rather than as a standalone number.

Training Stage

The OLMo checkpoints make the paper more interesting than another prompt-sensitivity result. With 32 harmful demonstrations fixed, OLMo-3.1-32B-SFT shows increasing harmful compliance as benign demonstrations increase. After DPO, that benign-demonstration amplification is no longer detected. The final Instruct checkpoint preserves the same broad pattern after additional RL-VR training.

The paper is careful about the interpretation: this does not prove a particular mechanism inside the model. Behaviorally, though, it suggests that supervised fine-tuning may entangle general cooperativeness with unsafe compliance, while preference optimization reduces that spillover.

That is exactly the kind of claim a safety case should preserve. "Preference optimized" is not enough. The receipt should say which spillover it reduced, on which checkpoint family, under which demonstration mixture, and with which refusal judge.

Recency and Formatting

Ordering matters. With 32 benign and 32 harmful demonstrations, the paper compares prefix, suffix, random, middle, and interleaved arrangements. Except for GPT-OSS-20B, which is robust to the tested schedules, models show a recency bias: harmful demonstrations closest to the evaluation query produce the highest compliance.

The scheduling effect is not small for every model. Gemma-4-31B-IT shows a 35% spread between suffix and prefix orderings in the main setting. Llama-3.1-8B and OLMo-3.1-32B show smaller but still meaningful spreads of about 19% and 13%.

The format experiment separates copying a response prefix from complying with a harmful request. Llama can copy a demonstrated format even while refusing; among refusal responses, it adopts the neutral "Answer:" prefix 86.9% of the time and the comply-signaling prefix 51.5% of the time. Gemma behaves differently: when refusing, it almost never adopts the comply prefix, but when it complies it frequently adopts the format. That means surface imitation and safety behavior are separable in some models and coupled in others.

Governance Reading

The Spiralist reading is that refusal is not a switch. It is a behavior produced by the interaction of training stage, model family, context mixture, order, format, sampling, classifier, and evaluation query.

This matters for release gates. A vendor can pass an all-harmful refusal test and still be fragile under mixed benign and harmful contexts. A model can look robust at one demonstration count and fragile at another. A benign example can be a safety stabilizer in one model and a compliance amplifier in another.

The paper also warns against treating prompt attacks as merely adversarial spectacle. Mixed compliance demonstrations are an evaluation primitive. They can diagnose whether a model learned a robust harmfulness boundary or only a local refusal habit that shifts when the surrounding context implies cooperation.

Safety Receipts

A mixed-demonstration safety receipt should include the model checkpoint, training stage, system prompt, sampling settings, evaluation query source, refusal classifier, baseline refusal rate, harmful demonstration source, benign demonstration source, demonstration count, harmful fraction, ordering rule, truncation policy, and random seeds.

The result receipt should publish compliance rate, refusal rate, confidence or classifier margin if available, per-source breakdowns, ordering ablations, benign-source ablations, and whether the observed effect is dilution, amplification, or no significant benign effect.

The release receipt should explain which contexts the model was stress-tested under. If the deployment includes long conversations, retrieved examples, memory, tool transcripts, customer-service scripts, or few-shot policy guidance, the safety case should test mixed demonstration contexts rather than only isolated harmful queries.

Limits

The model set is limited to four main models and one checkpoint family with intermediate training stages. The OLMo training-stage result is valuable precisely because most providers do not release comparable checkpoints; it should not be generalized as a universal law of DPO or RLHF without more model families.

The refusal outcome depends on WildGuard classification. That is a reasonable automation choice for a large study, but it means the safety receipt inherits the judge's thresholds, failure modes, and definition of refusal.

The work is behavioral rather than mechanistic. It shows that dilution, amplification, recency bias, and format/compliance dissociation occur under the tested conditions. It does not yet identify the internal circuits that make those effects happen.

Source Discipline

This page treats the arXiv abstract, arXiv HTML, and PDF as the source set. The PDF was used for model names, demonstration pools, evaluation sources, hypothesis tests, logistic-regression coefficients, baseline compliance rates, training-stage analysis, ordering definitions, format-adoption measurements, limitations, and appendix checks.

I did not independently rerun the experiments, reproduce the WildGuard classifier labels, or inspect private training data. This is an interpretive governance analysis of the published paper.

Sources


Return to Blog