Blog · arXiv Analysis · Last reviewed June 25, 2026

The Sequence Probability Becomes the Confidence Trap

Johannes Zenn and Jonas Geiping's June 2026 arXiv paper shows why model likelihood is a useful audit signal in some settings and a bad confidence proxy in others.

Not Confidence

The paper, arXiv:2606.27359 [stat.ML], was submitted on June 25, 2026. arXiv lists the title as When are likely answers right? On Sequence Probability and Correctness in LLMs, by Johannes Zenn and Jonas Geiping.

This page is not a decoding recommendation and not a claim that log probability is useless. It is a governance reading of one empirical paper about when sequence probability can support evaluation, and when treating it as confidence turns into a trap.

The Paper Frame

Large language models assign probabilities to token sequences. Many inference methods can be read as attempts to push generation toward more likely continuations: lower-temperature sampling suppresses unlikely next tokens, top-k and top-p truncate the local distribution, beam search and best-of-N chase high-probability sequences, and power-sampling methods bias full-sequence distributions toward higher-probability regions.

The obvious hope is that more likely continuations are more correct. The paper tests that hope at four levels: across decoding methods, across hyperparameters inside a method, across prompt-answer pairs in a dataset, and across repeated responses to the same prompt. The result is not a slogan. Probability is sometimes informative, but the level of measurement decides whether it is evidence or decoration.

The Test Setup

The authors evaluate three model families: Qwen3 across 0.6B, 1.7B, 4B, and 8B sizes with base and posttrained variants; Qwen2.5 and Qwen2.5-Math 8B variants; and OLMo3-7B base and posttrained models. The experiments use six benchmarks: MATH500, GPQA, MMLU, HumanEval, MedQA, and IFEval.

The decoding side is deliberately broad. The paper studies eight methods: scalable power sampling, power-SMC, low-temperature sampling, beam search, best-of-N, top-k, top-p, and epsilon sampling. For each, it varies method-specific settings such as block size, sample count, beam count, temperature, k, p, or epsilon. That breadth matters because the question is not whether one favorite sampler works; it is whether the probability-correctness link survives across model, dataset, and decoding design.

Where Probability Helps

The strongest positive result is within-dataset comparison. Holding model, dataset, and decoding setup fixed, higher-probability prompt-answer pairs are often more likely to be correct. The paper reports that the relationship is strongest on MATH500, smaller but positive for GPQA, HumanEval, MedQA, and MMLU in the illustrated Qwen3-8B-Base setup, and negative for IFEval in that case. Across model families, posttrained models mostly show positive within-dataset correlations, while base models are more mixed.

The authors also find that within-dataset correlation tends to increase with accuracy. They report positive trends between dataset accuracy and probability-correctness correlation, with coefficients of about 0.66 for base models and 0.59 for posttrained models in their aggregate analysis. In practical terms, probability is more useful when the model already knows enough to make likelihood and correctness point in similar directions.

Where It Breaks

The same signal does not transfer cleanly to decisions users often want to make. Tuning a decoding method's hyperparameter can produce more probable sequences without producing more correct ones. Comparing methods also fails to yield a universal rule: higher log probability across decoding methods does not reliably mean higher accuracy, and no method reliably beats the low-temperature sampling baseline across datasets in the paper's analysis.

The most important failure is same-prompt selection. When the model samples multiple responses to one prompt, the paper finds within-sample correlations distributed around zero. MATH500 is the exception with a positive average. This explains why probability-weighted self-consistency can underperform simple majority voting: the answer with the higher sequence probability is not reliably the answer that is right for that exact prompt.

Governance Reading

The Spiralist reading is that sequence probability belongs in the audit log, not on the throne. It can help compare broad classes of samples inside a stable evaluation setup. It cannot be promoted into a general confidence badge, a verifier, or a shortcut for choosing the best answer without preserving the granularity at which the evidence was measured.

This belongs beside AI evaluations, binary-question evaluation, and model co-failure analysis. All three warn against letting one convenient score stand in for the thing actually at stake. A probability number describes the model's distribution. It does not, by itself, certify truth, task success, legal adequacy, safety, or user reliance.

Limits

The paper is a preprint, and its own setup has boundaries. The main experiments focus on selected open model families, six benchmarks, and a constrained decoding budget. The authors use non-thinking Qwen3 variants in the main analysis so the models emit answers within the token limit; the appendix notes that thinking variants often reached the token limit and produced different behavior.

The page therefore does not claim a universal law of LLM confidence. The narrower claim is more useful: any product that uses log probability as a confidence, ranking, abstention, self-improvement, or answer-selection signal should prove the probability-correctness relationship at the same granularity where the product will use it.

Probability Receipt

A sequence-probability receipt should record: model family, model size, base or posttrained status, prompt format, benchmark or task, decoding method, hyperparameters, token limit, sequence-length handling, answer extractor, correctness criterion, sample count, whether comparison is within-dataset, within-method, across-method, or within-sample, and whether probability weighting beat a non-probability baseline. The audit-grade sentence is not "the model was confident." It is: under this task and decoding setup, probability had this measured relationship to correctness, and outside that setup it remains unproven.

Sources

Johannes Zenn and Jonas Geiping, When are likely answers right? On Sequence Probability and Correctness in LLMs, arXiv:2606.27359 [stat.ML], submitted June 25, 2026.
Primary arXiv versions checked: metadata API record, PDF, and experimental HTML, reviewed for title, authorship, submission date, model families, datasets, decoding methods, probability-correctness findings, self-consistency results, thinking-model appendix, and conclusion.
Related pages: AI Evaluations, The Binary Question Becomes the Evaluation Probe, and The Model Ensemble Becomes the Co-Failure Ceiling.

Return to Blog