Blog · arXiv Analysis · Last reviewed June 25, 2026

The Binary Question Becomes the Evaluation Probe

Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu, Chenyang Zhu, Shi-Xiong Zhang, and Sambit Sahu's June 2026 arXiv paper argues that an LLM judge should not only emit a score. It should expose the yes-or-no questions that made the score possible.

From Judge to Questions

The paper, arXiv:2606.27226 [cs.AI], is titled Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement. arXiv lists Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu, Chenyang Zhu, Shi-Xiong Zhang, and Sambit Sahu as the authors and records version 1 on June 25, 2026. The arXiv comment says the paper was accepted to the Second Workshop on Compositional Learning at ICML 2026 in Seoul.

This is a fresh companion to the LLM-judge annotation-budget essay, the evaluation-schema essay, and the LLM-as-a-judge reference. Those pages ask when automated evaluation is reliable or properly documented. This paper asks how to make the act of judging decomposable enough to inspect.

The problem is familiar: human evaluation is slow, lexical metrics miss open-ended quality, and holistic LLM judges can give confident scores that are hard to debug. BINEVAL's answer is to turn a scoring rubric into many atomic binary questions. The model does not simply say "4 out of 5." It answers smaller questions whose failures can be read.

What BINEVAL Does

BINEVAL has three main components. A meta-prompt turns a task prompt into fine-grained binary evaluation questions. An LLM answers those questions independently for a candidate output. The resulting yes-or-no verdicts are aggregated into interpretable, multidimensional scores.

That structure changes what an evaluation artifact is. A summary score becomes a trace of probes: did the answer preserve the central claim, omit a required fact, introduce a contradiction, stay on topic, or satisfy a task constraint? The paper emphasizes that the same task-agnostic meta-prompt can generate questions for summarization, dialogue, factual consistency, and instruction-following settings.

The governance value is not that binary questions are automatically right. It is that they create a place to inspect the evaluation. A bad holistic score can only be disputed at the level of vibes or calibration. A bad question can be challenged directly: it may be irrelevant, redundant, too strict, missing an important criterion, or biased toward a particular style.

Benchmark Evidence

The paper evaluates BINEVAL on SummEval, Topical-Chat, and QAGS. It compares against strong baselines including UniEval and G-Eval, with runs using Claude and gpt-oss backbones. The authors report that BINEVAL matches or outperforms those baselines overall, with especially strong results on factual consistency benchmarks such as QAGS.

The score-distribution result is as important as the correlation result. The paper says BINEVAL better matches human score distributions and avoids ceiling effects common in prior LLM judges. In practical terms, that means the evaluator is less likely to compress many outputs near the top and more able to discriminate between borderline and clearly flawed work.

The SummEval examples make the mechanism concrete. A plausible summary can receive a high holistic score while still containing a subtle factual error. A decomposed evaluator can ask separate factual questions and catch the misattribution, fabricated detail, or missing scope condition that a single overall score hides.

Prompt Improvement

BINEVAL also turns evaluation feedback into prompt-improvement evidence. The paper tests evaluator prompt optimization on SummEval and generation prompt optimization on IFBench. It reports improvements under both self-update settings, where the same model helps revise the prompt, and cross-model update settings, where disagreements between models identify useful lessons.

This is where the method becomes more than a grading tool. The question-level trace tells developers which part of the rubric or generation prompt failed. A prompt may be too strict about short summaries, too lenient about relevance, or confused about whether omission counts as contradiction. Those are repairable defects when the evidence is a list of failed probes rather than a single number.

For Spiralism, the deeper point is that self-improvement should mean traceable local repair, not mystical capability growth. A system improves because a visible diagnostic object names a failure mode and changes a prompt. The improvement record should say which questions failed, which lessons were retained, which prompt changed, and which held-out test survived.

Limits That Matter

The method depends on question quality. If the generated questions omit a critical criterion, reward the wrong behavior, or over-decompose a subjective dimension, the final score can look transparent while still being wrong. The paper's appendix includes a relevance failure case where over-specific decomposition hurts alignment with human judgment.

BINEVAL also increases evaluation work. It must generate questions, answer each one, aggregate verdicts, and in prompt-update settings extract lessons and rewrite prompts. That can add token cost, latency, and another layer of prompt dependence. The paper frames the method as training-free, not compute-free.

The largest governance risk is false objectivity. A list of yes-or-no questions can look more rigorous than a holistic score, but the questions are still produced by a model and a meta-prompt. They encode a rubric. They choose what counts. They can miss context, audience, consent, source quality, or downstream harm.

Governance Standard

Any binary-question evaluator used in consequential work should publish an evaluation record: task prompt, meta-prompt, generated questions, answer model, aggregation rule, calibration method, benchmark set, human reference data, prompt-update procedure, held-out tests, cost and latency, and examples of known failure modes.

The key discipline is to treat the question list as the audit object. A score should not travel alone. If an evaluator rates a candidate answer, the receiving system should be able to see which probes passed, which failed, which were unavailable, and which criteria were never asked.

When the binary question becomes the evaluation probe, the probe must remain contestable. Otherwise decomposed evaluation becomes merely a more elaborate way to hide judgment inside a machine-readable ritual.

Sources


Return to Blog