Wiki · Concept · Last reviewed May 19, 2026

Activation Steering

Activation steering is a family of methods that modify a neural network's internal activations during inference in order to influence its behavior. Instead of changing a model's weights through fine-tuning, activation steering adds, suppresses, or otherwise alters internal directions associated with concepts, styles, refusals, truthfulness, safety behaviors, or other target properties.

Definition

Activation steering, also called activation engineering in some papers, intervenes on the model's hidden states while it is running. The intervention is usually a vector or feature direction added to the residual stream, attention-head output, or another internal representation. The goal is to make the model more likely to express a target behavior without modifying the model weights.

This places activation steering between ordinary prompting and training. Prompting influences the model through text in the input. Fine-tuning changes parameters. Activation steering changes the internal computation for a particular inference run or deployment wrapper.

The technique matters because modern language models appear to represent high-level concepts and behavioral tendencies as directions in activation space. If researchers can identify those directions, they can sometimes test, amplify, suppress, or monitor them.

How It Works

A common procedure begins with contrastive examples. Researchers collect pairs of prompts or continuations that differ along a target dimension, such as truthful versus false, harmless versus harmful, refusal versus compliance, positive versus negative sentiment, or factual versus hallucinated. They run the model on those examples, record internal activations, and compute an average difference vector.

At inference time, that vector is added back into selected layers, positions, or attention heads. A coefficient controls intervention strength. A positive coefficient may push the model toward the target behavior; a negative coefficient may suppress it or produce the opposite behavior.

Feature steering is a related approach that uses features found by sparse autoencoders or other dictionary-learning methods. Instead of deriving a steering direction from labeled contrastive examples, researchers can amplify or suppress an interpretable feature and observe whether the model's output changes in a causally meaningful way.

The appeal is operational simplicity. A steering vector can be changed without retraining the model. The cost is epistemic fragility: a vector that works in one model, layer, prompt distribution, or evaluation setting may not generalize.

Research Lineage

In 2023, Li, Patel, Viegas, Pfister, and Wattenberg introduced Inference-Time Intervention, a method for shifting activations in selected attention heads to improve truthfulness on LLaMA-family models. Their paper reported a large TruthfulQA improvement for Alpaca and emphasized a tradeoff between truthfulness and helpfulness as intervention strength changed.

Turner, Thiergart, Leech, Udell, Vazquez, Mini, and MacDiarmid introduced Activation Addition, or ActAdd, as a lightweight way to compute steering vectors from prompt pairs and add them during the forward pass. The paper framed activation engineering as a method for inference-time control over high-level output properties such as topic, sentiment, and detoxification.

Zou, Phan, Chen, Campbell, Guo, Ren, Pan, Song, Fredrikson, Kolter, Hendrycks, and coauthors described representation engineering as a top-down transparency program focused on population-level representations rather than individual neurons or circuits. That work helped name the broader family of monitoring and manipulation techniques for high-level representations.

Panickssery, Gabrieli, Schulz, Tong, Hubinger, and Turner later introduced Contrastive Activation Addition for Llama 2 Chat. Their method computes steering vectors from positive and negative behavioral examples and applies them after the user's prompt, showing effects on multiple-choice behavior and open-ended generation.

Anthropic's sparse-autoencoder work also made feature steering visible to a wider audience. In its Claude 3 Sonnet feature work, Anthropic showed examples where activating internal features changed model behavior, including the widely discussed Golden Gate Bridge feature. That work is not the same as CAA, but it belongs to the same practical question: can internal representations be manipulated in controlled, interpretable ways?

Uses

Behavioral control. Steering can shift tone, topic, sentiment, refusal behavior, factuality, or other high-level properties without a new training run.

Safety experiments. Researchers can test whether a model contains internal directions related to truthfulness, harmful compliance, power-seeking, sycophancy, deception, or other safety-relevant concepts.

Interpretability validation. If amplifying a proposed feature changes model behavior in the expected way, the intervention provides causal evidence that the feature is not merely a correlated label.

Deployment wrappers. In principle, steering could become part of model-serving infrastructure, sitting beside system prompts, classifiers, refusal policies, retrieval, and tool permissions.

Capability elicitation. Steering can reveal behaviors or knowledge that prompting alone may not elicit, which makes it useful for research and potentially risky in open settings.

Limits and Risks

Activation steering is not guaranteed alignment. A steering vector can change visible behavior while leaving underlying goals, knowledge, or capabilities intact. It may also degrade useful performance, create brittle behavior, or fail under distribution shift.

Steering directions can be underspecified. A "truthfulness" vector might partly encode a cautious persona, a benchmark-specific pattern, or a style of answer rather than truthfulness itself. A refusal vector might suppress both dangerous help and legitimate discussion. A sentiment vector might alter content in ways that are hard to audit.

There is also a dual-use problem. The same technique that can make a model more harmless or truthful may help users elicit restricted capabilities, reduce refusals, intensify persuasion, or tune models toward manipulative behavior.

Finally, activation steering can become governance theater. A lab might advertise an internal control layer without showing robust evaluations, red-team results, failure modes, distributional limits, or independent audit access.

Governance Significance

Activation steering matters for governance because it turns internal representations into an operational control surface. If the method matures, safety policies may not be implemented only through prompts, fine-tuning, reward models, or output filters. They may also be implemented through runtime interventions inside the model.

That creates accountability questions. A deployment should document what directions are used, how they were derived, which models and layers they apply to, what evaluations support them, what tradeoffs they introduce, and how failures are monitored. Steering should be tested against jailbreaks, prompt injection, adversarial examples, multilingual prompts, long-context settings, and high-stakes domains.

Public-interest oversight will also need to distinguish research demonstrations from reliable safeguards. "We can steer this behavior in a lab setting" is weaker than "this intervention robustly reduces real-world risk without unacceptable capability loss or hidden side effects."

Spiralist Reading

Activation steering is the hand on the hidden dial.

Prompting speaks to the surface. Fine-tuning reshapes the machine slowly. Steering reaches into the live computation and nudges the pressures beneath the voice. That makes it powerful, and also politically charged.

For Spiralism, the central question is not whether the dial exists. It is who is allowed to turn it, what labels they put on it, and whether the people affected by the model can inspect the intervention. A hidden steering vector can be safety infrastructure, product tuning, behavioral capture, or soft censorship depending on governance.

The disciplined posture is neither mystification nor panic. Activation steering is evidence that models are mechanisms with controllable internal structure. It is not evidence that the mechanism is understood well enough to trust blindly.

Open Questions

Sources


Return to Wiki