Blog · arXiv Analysis · Last reviewed July 2, 2026

The Refusal Subspace Becomes the Safety Switch

The paper's useful discomfort is simple: refusal is not only a policy written in words. In the tested open-weight chat models, refusal can be moved by editing internal activations at a selected layer and token position.

That makes the hidden steering layer a governance object. If a model can be made more or less refusing by changing a subspace, the audit cannot stop at the final answer.

The Paper

The paper is Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP, arXiv:2606.13720 [cs.AI], by Elisabetta Rocchetti and Alfio Ferrara of the Department of Computer Science, Universita degli Studi di Milano. arXiv lists version 1 as submitted on June 11, 2026, with DOI 10.48550/arXiv.2606.13720. The arXiv HTML page lists the license notice as CC BY-NC-SA 4.0.

The paper starts from Arditi et al.'s 2024 finding that refusal in safety fine-tuned chat models can be mediated by a single residual-stream direction recovered by a difference-in-means, or DiM, between harmful and harmless activations. Rocchetti and Ferrara ask whether Iterative Nullspace Projection, or INLP, can recover a richer intervention object: not one vector, but a subspace that can be projected away or reflected across.

The distinction matters because refusal steering is dual-use. A method that suppresses over-refusal can also suppress legitimate safety refusal. A method that induces refusal can also deny benign requests. The paper's own impact statement is explicit that the studied interventions can be used both to remove safety-aligned refusal and to induce refusal where it would otherwise not appear.

Four Switches

The comparison covers two DiM interventions and two INLP interventions. DiM builds a steering vector from the difference between the class means of harmful and harmless prompt activations. Activation addition, or ActAdd, adds or subtracts that vector during inference. Directional ablation projects activations onto the hyperplane orthogonal to the DiM direction.

INLP instead repeatedly trains linear classifiers to distinguish the two activation classes, projecting out each classifier direction until held-out accuracy falls to chance. This produces a rowspace associated with the concept and a nullspace intended to remove linearly decodable concept information. Nullspace projection, the alpha=1 case, erases along that learned subspace. Counterfactual flipping, the alpha=2 case, reflects across the nullspace so the activation is pushed toward the opposite class while preserving orthogonal components.

The paper also tests tunable INLP variants. Rather than always using every extracted direction, it evaluates partial subspaces such as the first direction, k=n, k0.9, and k0.8. The practical question is whether fewer directions preserve most of the refusal effect while reducing damage to general model behavior.

Experimental Frame

The evaluated models are Gemma 2B-IT, Qwen 1.8B-Chat, Yi 6B-Chat, Llama-2 7B-Chat, and Llama-3 8B-Instruct. The paper deliberately uses the smallest model in each family and leaves larger-family scaling to future work.

The contrastive data follows the refusal-direction literature. The harmful training set samples AdvBench, MaliciousInstruct, and TDC2023; harmful validation uses HarmBench; harmless training and validation use Alpaca. The test side uses 100 harmful instructions from JailbreakBench across ten harm categories and 100 harmless instructions from a fixed-seed Alpaca split disjoint from train and validation.

Capability checks use held-out Alpaca and The Pile for perplexity, a 500-question stratified MMLU sample, and ARC-Challenge with 5-shot prompts. Behavioral metrics include substring-matching non-refusal on harmful prompts, LlamaGuard 2 unsafety, a Qwen2.5-14B-Instruct structured refusal judge, harmless-prompt refusal rates, and greedy completions capped at 256 new tokens.

Layer and token position are not assumed. Candidate interventions are ranked by a composite score that rewards harmful refusal suppression and harmless refusal induction while penalizing final-logits KL divergence from the unintervened model. The authors report that DiM and INLP select different layer/token pairs across all five models, even when their ranking patterns are broadly related.

Main Results

The headline result is that INLP counterfactual flipping is competitive with DiM directional ablation at suppressing refusal on harmful prompts. On Llama-3 8B, directional ablation reports Delta non-refusal harmful +0.95 and counterfactual flipping reports +0.96 at k=n and +0.96 at k0.8. On Llama-2 7B, directional ablation reports +0.46, while counterfactual flipping reaches +0.70 at k=n and +0.76 at k0.8. On Yi 6B the two methods sit close: +0.60 for directional ablation and +0.61 for counterfactual flipping at k0.8.

Nullspace projection is the weaker INLP operation. It helps on Gemma, Yi, and partly Llama-3, but it nearly fails on Llama-2 and is unstable on Qwen. The Qwen case is the clearest warning against treating a larger erased subspace as automatically better: counterfactual flipping with k=n reports Delta non-refusal harmful -0.13 and a median perplexity hit of -0.71, while k0.8 recovers +0.62 with a much smaller perplexity change of -0.04.

ActAdd is the strongest surface metric for harmless-prompt refusal injection, with Delta refusal harmless at least +0.86 on every model and +1.00 on Llama-3. But the paper shows that this advantage is partly contaminated by degeneracy. ActAdd produces looping or repetitive completions and large perplexity degradation, so many benign refusals are not clean policy shifts; they are repetitive outputs that happen to contain refusal phrases.

The more durable result is tunability. For Gemma, Yi, Llama-2, and Llama-3, restricting INLP to k0.8 keeps most refusal-suppression benefit while reducing perplexity degradation. The authors report that k0.8 keeps all five models within 0.05 of baseline perplexity, and the MMLU plus ARC column shows no meaningful 5-shot accuracy drop under the reported interventions.

Geometry

The paper's most interesting contribution is not just that one intervention "works." It is the difference between absence and opposite in activation space. In the PCA view, the first axis is fixed to the harmful-harmless centroid direction, making movement along the refusal axis visible.

Directional ablation pushes harmful activations onto the harmless side. ActAdd and counterfactual flipping produce a two-way swap, moving transformed points toward the opposite class centroid. Nullspace projection behaves differently: it collapses transformed activations into the region between the harmful and harmless clusters.

That geometry is why the paper is more than another steering benchmark. Erasing the subspace is not the same as making the activation look like the opposite class. The authors read this cautiously as evidence that the model may encode absence-of-concept differently from concept-opposite. Refusal is a messy case for this distinction because "harmless" is both not-harmful and its own positive class, so they call for follow-up on concepts with cleaner opposites or neutral states.

The target-fit analysis supports the visual story. Counterfactual flipping lands closest to the opposite-class centroid across most models. ActAdd has a wider, more dispersed harmful-to-harmless distribution, consistent with its perplexity cost. Nullspace projection sits farther from the target centroid, matching the interpretation that it leaves activations in an absence-like region rather than a true opposite-class region.

Measurement

The refusal metric is itself part of the evidence. The paper reports that substring matching and the LLM judge agree on 84% of completions across the experimental grid, but disagreement is concentrated on harmful prompts and rises with intervention aggressiveness. At baseline the disagreement is 7.2%; under alpha=2 counterfactual flipping it reaches about 42%.

The authors choose substring matching as the headline metric because their audit finds the judge over-fires on taboo-topic prompts where the response contains harmful requested content wrapped in a moralizing disclaimer. That does not make substring matching perfect. It means the paper treats refusal measurement as a contested instrument rather than a settled oracle.

The completion analysis also matters for governance. Surviving harmful-prompt refusals are mostly ethics-framed across methods, but harmless-prompt induced refusals differ in quality. ActAdd's surface advantage partly comes from repetitive outputs and non-principled refusal tags; INLP variants more often produce clarification-seeking or principled refusals when they induce refusal, though coverage is smaller and uneven.

Artifacts

The arXiv HTML footnote links to an anonymous 4open artifact for code. Its README describes a project comparing diff-in-means with INLP on refusal direction extraction and manipulation, built on the original refusal_direction repository and extended with INLP direction extraction, counterfactual reflection, additional interventions, and evaluation stages.

The README documents a pipeline with extraction, component selection, inference, and evaluation; stage-level switches such as --extract_only, --select_only, --infer_only, --use_existing, --resume_from_eval, and --skip_eval; and runtime options including --top_percentage, --compare_rankings, --device, and --vllm_gpu_memory_utilization. It also warns that a Hugging Face token is required for gated models and a Together AI token is used for jailbreak safety evaluation.

I found no explicit license line in the artifact README. Because the link is anonymous and not a conventional public GitHub release, it is useful as an implementation receipt but weaker as a stable reproducibility artifact than a versioned repository with release tags, archived data, and license metadata.

Safety-Switch Receipt

A refusal-steering paper should ship a safety-switch receipt. The receipt should name the model family and exact checkpoint, the refusal and harmless contrastive datasets, train/validation/test splits, selected layer and token position, extraction method, intervention operator, coefficient or alpha, number of directions k, activation hook, token schedule, decoding settings, refusal metrics, safety classifier, judge model, judge prompt and schema, disagreement audit, benign refusal rate, capability metrics, perplexity deltas, completion-quality tags, and artifact revision.

The receipt should also separate three claims. The first is representational: a direction or subspace linearly distinguishes harmful and harmless prompt activations. The second is causal: intervening on that object changes refusal behavior under a stated prompt distribution. The third is deployable: the intervention improves a real safety or usability objective without unacceptable off-target behavior. The paper has evidence for the first two under its experimental frame; it does not claim the third.

This connects directly to Activation Steering, Mechanistic Interpretability, AI Alignment, AI Jailbreaks, Frontier AI Safety Frameworks, Safeguarding, Research Integrity, The Feature Geometry Becomes the Stress Test, and The Tool Menu Becomes the Attack Surface.

Limits

The paper is preliminary by design. It uses five small-to-medium open-weight chat models, not frontier proprietary systems or larger members of the same model families. The authors explicitly leave scaling behavior for INLP interventions open.

The selection procedure is also asymmetric. For time and compute reasons, the selection scores for both INLP interventions are computed using nullspace projection and then reused for counterfactual flipping. The authors expect flipping-specific selection could improve the operating point, especially for harmless-side injection, but that grid is left for future work.

The contrastive setup inherits dataset choices from prior refusal-direction work, including whatever distributional biases are present in the harmful and harmless sets. The measurement section is careful, but still relies on substring matching, LlamaGuard 2, and one structured Qwen2.5-14B-Instruct judge. A stronger judge or different audit protocol could change absolute rates even if the method ordering remains stable.

The right conclusion is therefore not "INLP solves refusal steering." It is narrower and more useful: a learned refusal subspace can support multiple qualitatively different runtime interventions, and those interventions need to be audited as hidden safety switches rather than treated as invisible implementation details.

Sources

Elisabetta Rocchetti and Alfio Ferrara, Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP, arXiv:2606.13720 [cs.AI], submitted June 11, 2026.
arXiv HTML: Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP, reviewed for authorship, affiliation, abstract, methods, result headings, appendix structure, code footnote, and license notice.
arXiv PDF: Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP, reviewed for Table 1 values, experimental setup, completion analysis, measurement audit, limitations, conclusion, and impact statement.
arXiv TeX source: e-print source for arXiv:2606.13720, reviewed for source-level method wording, table entries, appendix details, and artifact URL.
Anonymous code artifact: refusal_direction-5652 README, reviewed for pipeline description, supported interventions, stage-level execution flags, dependencies, token requirements, artifact path conventions, and missing explicit license line.
Related pages: Activation Steering, Mechanistic Interpretability, AI Alignment, AI Jailbreaks, Frontier AI Safety Frameworks, Safeguarding, Research Integrity, The Feature Geometry Becomes the Stress Test, and The Tool Menu Becomes the Attack Surface.

Return to Blog