Blog · arXiv Analysis · Last reviewed June 25, 2026

The Model's Own Answer Becomes the Confidence Bias

The confidence number is not only about the answer. It also depends on whether the chat transcript tells the model that the answer came from itself.

Confidence Is Not Portable

The problem is that confidence is often treated as portable. It is not. The same answer can travel through different prompts, roles, templates, and elicitation methods. If a confidence estimate changes because the transcript says "assistant" instead of "user," then the number is partly a product of interface framing. That makes confidence a governance artifact, not a neutral property of the answer.

The Paper

arXiv lists Large Language Models Are Overconfident in Their Own Responses as arXiv:2606.03437v1 [cs.CL], submitted June 2, 2026. The authors are Mario Sanz-Guerrero, Manuel Mager, and Katharina von der Wense. The paper lists Johannes Gutenberg University Mainz and the University of Colorado Boulder as affiliations.

The paper asks why instruction-tuned conversational LLMs are often less calibrated than base models. Calibration is the relationship between stated confidence and observed correctness: if a system says "80 percent confident" across many cases, roughly 80 percent of those cases should be right. The authors measure that relationship with Expected Calibration Error, or ECE, and Brier score.

What Was Tested

The main experiment separates three settings: a base pre-trained model without instruction tuning, an instruction-tuned model without the chat template, and an instruction-tuned model with the chat template. That split matters because many deployments do not expose a raw model; they expose a conversation protocol with roles, wrappers, and message formatting.

The tested open-weight families were Llama 3.1 at 8B and 70B, Qwen3 at 4B and 30B, and Gemma 3 at 4B and 27B. The first calibration comparison used MMLU. The paper also uses three confidence elicitation methods: P(True), verbalized percentage confidence, and verbalized linguistic confidence.

The headline finding is not that instruction tuning is useless. On MMLU, instruction tuning increased average accuracy by 3.7 percent. The cost was calibration: ECE increased by 13.1 percent and Brier score by 6.5 percent. Adding the chat template brought another 1.1 percent average accuracy gain, but also added 2.74 percent ECE and 1.5 percent Brier score. Taken together, instruction tuning plus chat formatting increased ECE by 15.8 percent compared with the base setting.

Ownership Bias

The paper's sharpest move is to ask whether the model is more confident because it recognizes the content of a good answer, or because the transcript frames that answer as its own. The authors present a possible answer either as an assistant message or as a user message, then ask for confidence. The answer content can be identical. The owner field changes.

Across the MMLU ownership test, the reported deltas are positive: assistant-framed answers produce worse calibration than user-framed answers. For P(True), the average difference is 9.8 percent ECE and 8.8 percent Brier score. Linguistic confidence produces the largest gap, with average differences above 25 percent in both metrics. Raw confidence also rises, with an average P(True) increase of 15.8 percent and an average linguistic increase of 26.8 percent.

This is not ordinary sycophancy. If the dominant effect were deference to the user, the user-framed answer would receive the higher confidence. The observed pattern points the other way: the model treats its own produced answer as more trustworthy than the same answer placed in the user's mouth.

The Repair

The proposed mitigation is small enough to be uncomfortable. During confidence elicitation, present the model's candidate answer as if it came from the user rather than as the assistant's own prior response. The authors report that this inference-time strategy reduces overconfidence and can improve calibration by up to 26 percent without retraining.

That does not make the confidence number pure. It makes the dependency visible. A confidence score should carry its prompt role, chat template, and elicitation method with it. A score produced by P(True) in a user-framed check is not the same artifact as a verbalized percentage emitted after the model has just answered in the assistant role.

Governance Reading

This page belongs beside confidence calibration, sequence probability confidence traps, router confidence shifts, and uncertainty handoff budgets. The fresh lesson is that a model's confidence can be shaped by apparent answer ownership.

For safety and governance, that means confidence displays need provenance. A medical triage assistant, code repair agent, cyber defense workflow, or customer-support escalation system should not pass around a naked confidence value. The audit question is not "what confidence did the model report?" The audit question is "under which role framing, chat template, model state, benchmark, answer source, and elicitation method was this confidence produced?"

Limits

The result should not be overgeneralized. The main evidence is built around controlled question-answering evaluations, especially MMLU for the core multiple-choice ownership analysis. The authors also note that most experiments focus on open-weight LLMs because broader closed-model testing is costly. A production system with retrieval, tools, policy layers, or multi-agent routing can add new failure modes that this paper does not settle.

The paper also does not say that confidence elicitation is hopeless. It says the elicitation protocol is part of the measurement. That is a narrower and more useful claim: calibration work has to include the transcript frame, not only the model weights and the answer text.

Confidence Receipt

A confidence receipt should record: model version, tuning status, chat template, system prompt, answer source, assistant-or-user framing, elicitation method, benchmark, accuracy, ECE, Brier score, confidence distribution, and mitigation status. The audit-grade sentence is: this confidence number was produced under these conditions and should not be reused after the prompt role changes.

Sources

Mario Sanz-Guerrero, Manuel Mager, and Katharina von der Wense, Large Language Models Are Overconfident in Their Own Responses, arXiv:2606.03437v1 [cs.CL], submitted June 2, 2026.
Primary arXiv versions checked: PDF and experimental HTML, reviewed for title, authorship, date, affiliations, model list, MMLU setup, confidence elicitation methods, ECE and Brier score definitions, ownership-bias deltas, mitigation description, and limitations.
Related pages: Confidence Calibration, The Sequence Probability Becomes the Confidence Trap, The MoE Router Becomes the Confidence Shift, and The GUI Becomes the Uncertainty Handoff Budget.

Return to Blog