Blog · arXiv Analysis · Last reviewed June 25, 2026

The Emotion Vector Becomes the Affect Map

A June 2026 arXiv paper tests whether emotion-vector geometry appears in open-weight language models. The result is useful only if read carefully: a model can encode affective concepts without feeling them.

The Paper

The paper is Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs, arXiv:2606.26987 [cs.CL; cs.AI], by Sinie van der Ben, Raphaël Baur, Yannick Metz, and Mennatallah El-Assady. arXiv records version 1 as submitted on June 25, 2026. The paper says it was published at the Mechanistic Interpretability Workshop at ICML 2026 and links public code at sinievanderben/emotion_experiment.

The authors replicate and extend earlier emotion-vector work from Claude Sonnet 4.5 in two open-weight models: Apertus-8B-Instruct-2509 and Gemma-4-E4B-it. Their question is not about machine feeling. It is whether internal activation directions that track emotion concepts can be recovered across architectures, layers, and stimulus corpora.

Not Feeling

The title invites a trap. "Happiness" here is not an inner state. It is a coordinate in a representational space built from stories, residual-stream activations, contrast vectors, principal components, and human ratings of valence and arousal. A model can learn where happy-language, gloomy-language, or anxious-language sits in text space without having an experience corresponding to any of those words.

That distinction is not pedantry. Affective interpretability will be misused if every discovered vector becomes folk psychology. The safer reading is narrower: if an assistant produces emotionally styled responses, and if internal directions can help explain or steer that behavior, then those directions belong in the audit record. They are engineering artifacts, not testimony.

How the Map Was Made

The paper follows the emotion-vector method by generating synthetic short stories for 171 emotion concepts. Each model generated nine stories per emotion, for 1,539 stories in each corpus, plus 40 neutral stories. The prompts asked the model to convey the target emotion implicitly rather than naming it. The neutral stories were used to project out general linguistic structure so the remaining contrast vector would better isolate emotion-related activation patterns.

The authors then collected residual-stream activations across layers: layers 1 through 31 for Apertus-8B and layers 1 through 40 for Gemma-4-E4B. At each layer, they applied PCA to the emotion contrast matrix and compared the first two principal components with human valence and arousal ratings from the NRC Valence-Arousal-Dominance lexicon. They also used linear CKA to compare how the emotion-vector space changed across layers.

What Replicated

The strongest replication is valence. The paper reports peak PC1-valence correlations of r = 0.76 for Apertus-8B and r = 0.83 for Gemma-4-E4B, close to the r = 0.81 reference reported for Claude Sonnet 4.5 in the earlier work. That supports the claim that valence-like geometry can appear in more than one model family.

The route to that geometry differs. Apertus-8B shows little valence alignment in early layers, then a sharp mid-to-late emergence, exceeding r = 0.60 around layer 21 under both story conditions. Gemma-4-E4B shows strong early valence encoding, with peak layers around 13 to 16 depending on the story corpus, followed by a collapse or weakening later in the network. Similar endpoint scores can therefore hide different internal paths.

Arousal is weaker and more corpus-sensitive. With Apertus-generated stories, both models peak below r = 0.21 for PC2-arousal alignment. With Gemma-generated stories, the paper reports stronger arousal alignment, up to r = 0.45 for Apertus-8B and r = 0.41 for Gemma-4-E4B. The authors hypothesize that the Gemma story corpus may contain more arousal-discriminative cues, but they leave corpus-content verification to future work.

The Governance Receipt

The governance receipt for affective vectors should name the model, layer, hook point, stimulus corpus, emotion list, neutral baseline, projection method, rating source, correlation metric, and whether any steering was performed. It should separate three claims that product language often fuses: the model represents an affective concept, the representation affects output behavior, and the deployed system is safe to use around emotionally vulnerable users.

This page belongs beside the affective-safety framework, the workplace emotion-detector essay, the mental-health framing-instability essay, and the AI religion mirror-trap essay. The shared rule is that affective signals require extra claim hygiene. They can explain behavior, steer behavior, and mislead observers about behavior.

For deployment, the key question is whether an affect vector becomes a monitoring instrument, a steering control, a safety claim, or a marketing story. Each use needs different evidence. A research vector can be exploratory. A runtime control needs validation, false-positive analysis, incident logging, and a policy for what happens when the affect map disagrees with the visible answer.

Limits

The paper's limits are central to its meaning. The prior Claude work did not release code, so this implementation reconstructs the method from the described procedure. The new analysis covers two open-weight model families, not the model universe. The story corpora are model-generated, which means the stimulus set can carry its own artifacts. A model-independent stimulus set would be a stronger control.

The authors also identify causal validation as future work. Their study recovers representational geometry, but steering at peak-correlation layers would be needed to test how directly the recovered directions drive behavior. Until that evidence exists for a given model and use case, the responsible conclusion is modest: affective concepts are mappable in these systems, but the map is not the territory.

Sources

Sinie van der Ben, Raphaël Baur, Yannick Metz, and Mennatallah El-Assady, Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs, arXiv:2606.26987 [cs.CL; cs.AI], submitted June 25, 2026.
Primary arXiv records checked: arXiv API metadata, HTML full text, and PDF, reviewed for title, authorship, date, workshop note, model list, dataset construction, layer collection, PCA/CKA analysis, results, code link, and limitations.
Code repository cited by the paper: sinievanderben/emotion_experiment.
Related pages: The Affective Safety Framework, The Emotion Detector Becomes the Workplace Polygraph, The Framing Cue Becomes the Mental-Health Instability Test, and AI Religion and the Mirror Trap.

Return to Blog