Blog · arXiv Analysis · Last reviewed June 24, 2026

The Multimodal Evidence Order Becomes the Answer

The June 2026 arXiv paper Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models, by Akshay Paruchuri, Sanmi Koyejo, and Ehsan Adeli, asks whether multimodal models give the same answer when the same evidence arrives in a different order.

Evidence Is Not a Queue

The paper, arXiv:2606.26079v1 [cs.CL], was submitted on June 24, 2026. Its premise is operational: if two prompts contain the same evidence, and order should not matter, the model should not change its answer merely because the evidence was permuted.

That sounds like a benchmark detail until it is placed inside a real system. In retrieval-augmented generation, evidence order can be set by a retriever. In agentic systems, it can be set by tool scheduling. In multimodal interfaces, it can be set by upload order, camera ordering, document rank, or an adapter. The user may never see that ordering decision, but the model does.

The Spiralist point is that "same evidence" is not automatically the same input. A benchmark that scores one canonical presentation can report capability while missing presentation fragility. The paper treats that fragility as measurable: run semantically equivalent orderings, normalize the answer, and ask whether the output changes.

What Facet-Probe Tests

Paruchuri, Koyejo, and Adeli introduce Facet-Probe, a five-facet audit of order sensitivity in multimodal large language models. The facets are option order, evidence-chunk order, document-rank order, image-set order, and mixed-modality order. The audit covers 18 models: six frontier closed-source systems and 12 open-weight systems from the Qwen, InternVL, Kimi, and MedGemma families, as reported by the authors.

The paper says closed-source models were served between May 4 and May 25, 2026, and treats that access window as a reproducibility anchor. The panel spans 12 datasets across multiple-choice reasoning, multihop question answering, RAG, multi-image visual question answering, and free-form mixed-modality RAG, for more than 400,000 reported trials.

The operating point is K=6 orderings per item. For option-order and image-set tests, scoring maps the chosen label or image-position answer back to source content, so a flip means the selected content changed. For mixed-modality free-form outputs, the paper uses LLM-judge scoring and flags that measurement layer as a caveat.

Flip Rate as a Record

The central metric is cross-ordering flip rate. At item level, a flip occurs when normalized answers across the sampled orderings are not all the same. This is intentionally blunt. It asks the deployment question: can the answer change when order changes?

The paper reports that none of the 18 audited models are order-invariant. Screened per-facet panel-mean K=6 flip rates range from 0.24 to 0.50: mixed-modality order at 0.50, evidence-chunk order at 0.41, option order at 0.36, document-rank order at 0.26, and screened image-set order at 0.24. A same-ordering control repeats the canonical ordering to estimate a decoder-noise floor. That control is scoped to a Gemini subset at temperature 0, not the full panel.

This connects directly to existing site concerns about evaluation artifacts, source-aware factuality, agent reliability scoring, and AI audit trails. A score is not only a number. It is a record of what was tested, what was assumed invariant, and which perturbations were allowed to disturb the result.

Capability Is Not Invariance

Capability helps, but it does not close the problem. The paper reports a strong descriptive relationship between model capability and lower mean K=6 flip rate across the 18-model panel, while warning that the models are not independent samples and that capability is recovered from the same trial outcomes used for flip analysis. The key operational fact is simpler: the best model in the audit still flips on 13.4% of trials.

The mechanism analysis is deliberately cautious. The authors sample flipped items and ask a Gemini-Pro judge to classify failure modes. Content rationalization is the dominant primary-judge label in non-dialog categorical cells, but the paper treats mechanism labels as diagnostic rather than causal and marks an early image-set mechanism row as exploratory because it was sampled before a position-reference screen.

Mitigation as Escalation

The mitigation section is mostly a warning against cheap comfort. A K=2 disagreement screen can query two orderings and abstain or escalate when they disagree. The paper reports that this is the only non-baseline deployable Pareto improvement among the tested policies, with +0.019 selective accuracy at cost 2 across 26 screened cells. The gain depends on treating abstained items as routed elsewhere.

Prompt transformations are more fragile. Canonicalize-then-Answer reduces flips in one hard text evidence-chunk setting for Gemini Pro, from 0.30 to 0.18 on MedXpertQA, but is null or slightly worse for Gemini Flash there and does not reduce flips in the tested visual MathVision cells. The paper's conclusion is narrow: prompt-level mitigation is modality-conditional and does not compose reliably.

That makes mitigation a workflow problem. The useful control is not a magic prompt. It is a record that says which orderings were tried, whether they agreed, which disagreements escalated, and whether production used the same ordering policy that evaluation used.

Limits That Matter

The paper's own limitations are important. The same-ordering control is not panel-wide. K=6 samples only part of the possible permutation space when there are more than six unique orderings. Temperature 0 API calls are not guaranteed deterministic. Dataset coverage excludes many open-ended, agentic, non-English, and live user distributions. Mixed-modality measurements use an LLM judge, and the paper warns that judge-corrected statistics are not identical to judge-free correctness flips.

Those limits do not weaken the governance lesson. They define the boundary of the claim. This is not proof that every future multimodal model will be order-sensitive in the same way. It is evidence that order robustness needs its own measurement axis before a benchmark score is used as a deployment warrant.

Governance Standard

A consequential multimodal deployment should publish order assumptions as part of its evaluation file. That file should name the model and serving window, task suite, ordering facets, K, decoding policy, prompt family, scoring method, answer normalization, judge use if any, same-ordering control if any, and escalation rule for cross-order disagreement.

For RAG and agent systems, the deployment record should also preserve retriever rank, chunk order, document order, image order, upload order, tool-result order, and any canonicalization layer that rewrites those orders before the model sees them. Otherwise the institution cannot tell whether a failure came from the model, retriever, adapter, or hidden ordering convention.

The practical rule is simple: if order is supposed to be irrelevant, test that assumption directly. Do not let one canonical ordering stand in for reliability. In multimodal systems, the evidence order can become the answer.

Sources


Return to Blog