Blog · arXiv Analysis · Last reviewed June 25, 2026

The Visual Default Becomes the Prior Override

A June 2026 arXiv paper traces how vision-language models resolve conflicts between visible evidence and stored world knowledge.

The Default Is Not Neutral

A vision-language model can see a blue strawberry and still know that strawberries are usually red. The governance problem begins when the interface asks which evidence should win. Some tasks require the image. Some require stored world knowledge. Some require the model to notice that the two are in conflict and refuse a simple answer.

The Spiralist reading is that a multimodal answer is more than a caption when the task turns on conflicting evidence. It is an arbitration between channels: pixels, prompt, memory, task instruction, and learned priors. When one channel becomes the default, it becomes a hidden policy. The user sees an answer; the system has already selected an evidence hierarchy.

The Paper Frame

The source is Niclas Lietzow, Danielle Bitterman, Carsten Eickhoff, William Rudman, and Michal Golovanevsky's Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models, arXiv:2606.28273v1 [cs.CL], submitted June 26, 2026. The PDF lists affiliations with the University of Tubingen, Harvard University, and the University of Texas at Austin.

The paper asks a mechanistic question about vision-language models, not a broad philosophical question about perception. When visual evidence and memorized world knowledge conflict, which internal components make the model follow one source rather than the other?

The Conflict Task

The authors use the Visual-Counterfact dataset, which contains 469 digitally recolored common-object images. Two examples were excluded because original and counterfactual colors overlapped, leaving 467 examples. A strawberry may be recolored blue, or an elephant orange. The model is then tested under two grounding modes: one asks for the color visible in the image, and the other asks for the object's usual color.

The evaluated models span three VLM families and five sizes: Qwen-VL-2.5 at 3B and 7B parameters, LLaVA-NeXT 7B with a Mistral backbone, and PaliGemma at 3B and 10B parameters. The authors restrict quantitative analysis to correctly conflicting examples where the unmodified model can answer both the visual and prior prompt as expected. That restriction is important because the paper is not measuring ordinary accuracy; it is measuring the mechanism of arbitration when both answer routes are available.

The method combines residual-stream activation patching, attention-head patching, MLP sublayer patching, and zero-ablation. In plain terms, the authors swap or remove internal component outputs at the answer position and ask whether the model's top answer flips from the usual-color answer to the visible-color answer, or the reverse.

What the Circuit Shows

The behavioral starting point is stark. In non-conflict settings, all five models achieve 86 to 96 percent accuracy across model families and scales. Under the conflict condition, where the image is counterfactual but the prompt asks for prior knowledge, accuracy falls to 17.7 to 55.7 percent. The models often report what they see rather than what the prompt asks them to know.

The component result is the paper's core contribution. The authors find that visual grounding appears earlier and more robustly, while prior grounding depends on a sparse set of attention heads. Only 2.5 to 4.8 percent of heads are classified as strongly mediating the conflict, concentrated mainly in the second half of the network. Ablating the promoting attention-head group flips prior-grounded predictions in 68 to 96 percent of correctly conflicting examples, while changing visually grounded predictions in only 0.8 to 7.5 percent.

The paper further divides the heads into routing heads and writing heads. Routing heads modulate information flow. Writing heads project answer-token information into the residual stream. MLP effects point in the same direction, but the authors report weaker and less consistent effects, treating MLP sublayers more as amplifiers of memorized prior knowledge than as primary routing components.

Governance Reading

For deployed multimodal systems, the lesson is not that visual evidence is bad. The lesson is that evidence priority is task-dependent. In medical imaging, insurance claims, border screening, accessibility tools, and evidence review, the right answer may require the system to privilege the image, the label, background knowledge, metadata, or uncertainty. A model that silently defaults to one channel can look reliable until the task asks for a different channel.

An audit record for a consequential VLM should therefore include the task's grounding rule. Did the prompt ask for visible evidence, usual-world knowledge, source-document evidence, or contradiction detection? Were counterfactual visual tests included? Were model changes checked against conflict cases? If a product says the model is grounded, grounded in what?

Limits and Cautions

The authors are clear about scope. The study focuses on color-property conflicts in Visual-Counterfact, a controlled setting chosen for clean causal analysis. It remains open whether the same mechanism extends to shape, size, spatial relations, document interpretation, medical images, or real-world scenes with many objects.

The evaluated models are in the 3B to 10B parameter range. Larger models may use different strategies as capacity and memorized knowledge change. The interventions also target the last token position where the answer is produced. That is standard in this mechanistic setup, but it may miss components that matter earlier while image tokens are being processed.

Audit Receipt

The audit-grade sentence is: Lietzow, Bitterman, Eickhoff, Rudman, and Golovanevsky report that in five VLMs, prior-knowledge answers under visual conflict depend on a sparse late attention-head circuit, while visual grounding remains the more robust default, arXiv:2606.28273.

The receipt is: before trusting a multimodal answer, preserve the task grounding rule, conflict-test results, model family, prompt contrast, visual source, prior source, intervention evidence, and known limits of the benchmark.

Sources


Return to Blog