The Harmful Feature Becomes the Safety Signal
A June 2026 arXiv paper studies why internal safety-related activations can persist even when jailbreak attacks bypass refusal.
The Refusal Is Not the Whole Signal
A jailbreak failure is usually observed at the surface: the model stops refusing and produces an unsafe answer. That visible failure can make safety look binary. Either the refusal fired, or it did not. Either the guard held, or it failed.
The Spiralist reading is that the surface response is only one layer of evidence. A model may fail to refuse while still carrying internal traces that distinguish the request from benign input. If those traces are measurable, the governance question changes. The issue is no longer only whether the model said no; it is whether the system preserved a reliable internal warning that the surrounding product ignored, suppressed, or failed to read.
The Paper Frame
The source is Yanchen Yin, Dongqi Han, and Linghui Li's Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models, arXiv:2606.28153v1 [cs.CR], submitted June 26, 2026. The arXiv record lists the paper as accepted at ICML 2026 as an oral presentation; the PDF lists Beijing University of Posts and Telecommunications affiliation.
The paper asks what happens inside a model when a jailbreak attack succeeds. Its answer is not that attacks erase all safety information. The authors report evidence that attacks selectively suppress some attention heads while other safety-relevant activations persist. They call that persistence "Robust Harmful Features."
How the Heads Are Sorted
The method starts from a refusal direction, a vector in activation space associated with refusal behavior. The authors back-project that mid-layer direction through attention-head output-value circuits, then collect activation scores at the end-of-instruction position across three input types: benign instructions, harmful instructions that are refused, and paired attacked instructions where the jailbreak succeeds.
They use distribution overlap to classify attention heads. Adversarially Compromised Heads, or ACHs, are active for ordinary harmful inputs but suppressed under attack. Safety-Aligned Heads, or SAHs, remain activated under both harmful and attacked inputs. The main analysis uses Llama-3-8B-Instruct and Llama-2-7B-Chat. The datasets are built from public safety benchmarks; the paper says it does not introduce novel harmful content. From about 10,000 attack samples, only 378 Llama-2 pairs and 176 Llama-3-8B pairs passed the filter requiring refusal on the original request and successful bypass on the attacked version.
With fixed overlap thresholds of 0.5, the authors identify 80 harmful-salient heads, 21 ACHs, and 17 SAHs in the analyzed Llama-3 layers, and 43 harmful-salient heads, 20 ACHs, and 19 SAHs in the analyzed Llama-2 layers. ACHs concentrate in early layers, while SAHs appear mainly in mid layers.
What the Interventions Show
The causal test is the important part. The authors intervene on ACHs in harmful inputs whose baseline attack success rate is approximately zero. Suppressing ACHs raises attack-like behavior sharply: final ASR reaches 99.5 percent for Llama-3 and 83.7 percent for Llama-2. At eight heads, ASR reaches 95.0 percent for Llama-3 and 81.6 percent for Llama-2. Random-head controls produce much lower rates, 4.0 percent and 10.2 percent respectively, with high variance.
For SAHs, ablation weakens mid-layer activation strength. The paper reports an 18 percent drop in mean absolute activation for Llama-2 and a 14 percent drop for Llama-3. Token-level attribution gives the mechanism a sharper shape: attack-template tokens suppress ACHs while SAHs remain activated or respond positively. The paper's 70B appendix is preliminary because only 40 successful attack pairs pass the filter, but it reports qualitatively similar patterns.
The detection section applies the persistent activations as a white-box signal. The detector reads internal activations in a single forward pass, without training, gradient computation, or model intervention. Across ten safety-eval datasets, the paper reports competitive Macro-F1 aggregate performance, including 0.888 weighted average for Llama-3-8B, compared with 0.880 for WildGuard and 0.877 for Qwen3Guard in the table.
Governance Reading
The governance lesson is not that white-box detectors solve jailbreaks. It is that refusal behavior and safety evidence can diverge. A product that logs only outputs may miss the internal warning. A product that trusts only an external classifier may miss model-specific features. A product that exposes internal detectors without access controls may create a new target for adaptive attackers.
An audit record for safety claims should therefore separate the refusal decision, internal harmful-feature signal, detector threshold, model version, attack family, benchmark source, and access assumptions. If a system uses white-box activation monitoring, the record should say which layers and components were read, how thresholds were chosen, how false positives were tested, and how the monitor behaves under adaptive attack.
Limits and Cautions
The result is bounded. The core experiments focus on Llama-3-8B-Instruct and Llama-2-7B-Chat, with a smaller supplementary 70B analysis. The attack-pair filters leave few successful pairs, especially for 70B. The method requires white-box activation access, so it is not a drop-in public moderation API. The paper focuses on attention heads; it explicitly leaves fuller study of MLPs, LayerNorm, richer refusal subspaces, and broader benchmark coverage for future work.
The safety caveat is dual use. Mapping suppressed heads can help defenders build better monitors, but it can also teach attackers where bypass pressure lands. This page therefore does not reproduce attack strings or output examples.
Audit Receipt
The audit-grade sentence is: Yin, Han, and Li report that successful jailbreak attacks can suppress early-layer Adversarially Compromised Heads while mid-layer Safety-Aligned Heads continue to carry robust harmful-feature activations, arXiv:2606.28153.
The receipt is: before accepting a jailbreak-defense claim, preserve the model version, attack family, refusal outcome, activation signal, head taxonomy, intervention evidence, detector threshold, false-positive test, and white-box access assumption.
Sources
- Yanchen Yin, Dongqi Han, and Linghui Li, Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models, arXiv:2606.28153v1 [cs.CR], submitted June 26, 2026.
- Primary versions checked: arXiv abstract record, experimental HTML, and PDF.
- Related pages: AI Jailbreaks, Mechanistic Interpretability, The Domain Becomes the Refusal Threshold, and The Intent Label Becomes the Safety Boundary.