Blog · arXiv Analysis · Last reviewed July 2, 2026

The Reflection Score Becomes the Verifier

RefGRPO starts from a small but important failure: an agent can see concrete environment feedback and still misjudge whether it solved the task. The paper treats post-feedback reflection as something to train, measure, and govern.

The Paper

The paper is Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL, arXiv:2606.14211 [cs.AI, cs.LG], by Yinglun Zhu. arXiv lists version 1 as submitted on June 12, 2026, with DOI 10.48550/arXiv.2606.14211. The arXiv HTML lists Yinglun Zhu at University of California, Riverside.

The problem is not ordinary confidence calibration over a generated answer. The agent acts in an environment, receives concrete feedback such as SQL execution results, error messages, or tool outputs, and then reports whether it believes the action completed the task. The paper asks why that post-feedback reflection remains unreliable and how to train it without adding a reward model, judge, or annotation channel.

The Reflection Gap

The interaction has three stages. First, the agent takes an action after reasoning. Second, the environment executes that action and returns an observation. Third, the agent reflects on the action and observation and emits a binary reflection score: solved or not solved. After the episode, the environment outcome reward is used for reinforcement learning, but it is not shown to the agent during the interaction.

The reflection gap is the difference between having observed feedback and correctly judging what that feedback means. The paper finds a persistent underconfidence mode: agents often flag their own answers as wrong even when they are actually correct. Standard outcome-only RL improves task solving, but it barely repairs this post-feedback self-assessment.

The reason is a credit-assignment mismatch. Outcome-only RL gives negative advantage to an incorrect rollout even when the agent honestly flags it as wrong, so it can train the model to suppress useful error flags. That makes the model a stronger solver while leaving its self-verification channel damaged.

RefGRPO

RefGRPO extends the authors' GRPO+ baseline with a free calibration bonus and a schedule for its coefficient. The bonus is "free" in the narrow training sense: both the binary reflection score and final outcome are already present during RL training, so the method can contrast reflection with outcome without a separate reward model, LLM judge, or external annotation.

The bonus lifts well-calibrated rollouts above miscalibrated rollouts before group normalization. Honest reflection should get positive relative advantage even when the task outcome is bad, and dishonest or confused reflection should be penalized even when the answer happens to succeed.

The dynamic schedule matters. A large coefficient early front-loads calibration, then the coefficient decays so task performance can continue improving. The paper's ablations show that a fixed calibration coefficient can improve reflection at the cost of task accuracy, while the dynamic schedule better preserves both.

Metrics

The evaluation reports task accuracy, reflection accuracy, overconfidence rate, underconfidence rate, and Chow score. Overconfidence is the fraction of answers flagged correct that are actually wrong. Underconfidence is the fraction of answers flagged wrong that are actually correct. Both should be low.

Chow score adapts the classical reject-option framing to agents. The agent receives ordinary correctness credit when it commits, and partial credit for self-flagged errors. An always-commit model collapses to task accuracy; a better-calibrated model rises above that baseline because its refusal or error flag is informative.

Experiments

The task environment is text-to-SQL because it has concrete feedback and verifiable binary outcomes. Training uses 4,660 text-to-SQL problems from Spider and OmniSQL. Evaluation covers five benchmarks: Spider-Dev, Spider-DK, Spider-Realistic, Spider-Test, and Bird-Dev.

The model set includes Qwen2.5-Coder-3B-Instruct, Qwen2.5-Coder-7B-Instruct, and Llama-3.2-3B-Instruct. Baselines include GRPO+, plus external 7B SQL specialists OmniSQL-7B and SQL-R1-7B for comparison. GRPO+ and RefGRPO are trained for 6 epochs with shared hyperparameters, including batch size 128, maximum response length 3000 per turn, AdamW, temperature 1.0, and 8 rollouts.

In single-turn Qwen2.5-Coder-3B results, RefGRPO reaches 70.2 percent task accuracy, 79.7 percent reflection accuracy, 23.1 percent overconfidence, 1.3 percent underconfidence, and 71.1 Chow score. The GRPO+ baseline is 69.4, 76.6, 24.6, 23.7, and 68.5 on the same metrics. The main signal is underconfidence collapsing from 23.7 percent to 1.3 percent while task accuracy also rises.

In multi-turn Qwen2.5-Coder-7B results, RefGRPO reaches 76.5 percent task accuracy, 77.8 percent reflection accuracy, 22.4 percent overconfidence, 7.7 percent underconfidence, and 76.5 Chow score. GRPO+ reaches 75.1, 76.2, 22.8, 44.4, and 73.0. The headline result is the underconfidence drop from 44.4 percent to 7.7 percent while task accuracy rises from 75.1 percent to 76.5 percent across the five benchmarks.

Self-Improvement

The second claim is operational. Once reflection is calibrated, it can act as a pseudo-reward. The paper continues RL training from intermediate checkpoints while replacing the outcome reward with the model's own reflection score and masking reflection tokens so gradients land on task tokens.

Starting from a RefGRPO checkpoint, this self-improvement phase raises average single-turn task accuracy from 67.1 percent to 69.9 percent. Starting from a GRPO+ checkpoint, it only moves from 68.3 percent to 68.8 percent. The explanation is that GRPO+ quickly commits to almost everything, making self-rewards nearly constant and weak; RefGRPO keeps a more informative commit rate.

The same calibration helps test-time selective prediction. With multiple rollouts, the agent can commit only to rollouts it flags as correct. In the appendix table, RefGRPO has higher average selective-prediction accuracy than GRPO+ in both single-turn and multi-turn settings.

Governance Standard

A reflection-trained agent should ship a reflection calibration receipt. The receipt should include the base model, RL algorithm, calibration coefficient schedule, training data sources, task family, environment version, verifier or outcome rule, observation schema, reflection prompt, valid-reflection parser, task accuracy, reflection accuracy, overconfidence rate, underconfidence rate, Chow score, error-flag credit, commit rate, self-improvement protocol, selective-prediction protocol, rollout count, decoding temperature, and per-benchmark breakdown.

The receipt should also distinguish three roles that are easy to collapse. The environment outcome is the training signal. The reflection score is the learned self-verification signal. The deployment decision is a policy choice. A calibrated reflection can help route, abstain, sample, or self-improve, but it should not become final authority without an external consequence model and audit trail.

This connects directly to AI Agents, Confidence Calibration, RLHF, Reward Models, Reward Hacking, Reinforcement Learning with Verifiable Rewards, AI Evaluations, Tool Use and Function Calling, The Verifier Horizon Becomes the Agent Reward, The Visible Reward Becomes the Training Target, The Ground Truth Gap Becomes the Reward Loop, The Model's Own Answer Becomes the Confidence Bias, The Sequence Probability Becomes the Confidence Trap, and The Chain of Thought Gain Becomes the Agent Policy. Self-verification becomes governable only when the reward source, reflection channel, and abstention rule are separately logged.

Limits

The paper names two explicit limits. The largest trained model is 7B scale, and the reflection signal is binary rather than a real-valued confidence score. Both matter: larger agents may have different calibration behavior, and real deployments often need graded confidence, not only solved versus not solved.

The domain is also narrow by design. Text-to-SQL has clean executable feedback and binary outcome rewards. RefGRPO is most persuasive where environment feedback is concrete and outcome supervision is reliable during training. Domains with noisy tool output, partial credit, delayed consequences, subjective quality, adversarial observations, or safety-critical side effects need stronger validation.

The "free" bonus is free of extra annotation, not free of ground truth. The method still needs actual outcomes during RL training to contrast with reflections. At deployment time, a reflection score is a learned proxy. It can reduce blind confidence, but it can also become another target for reward gaming if downstream systems treat it as proof rather than evidence.

At review time, I found the arXiv record, HTML, PDF, and indexing/social pages, but no official code repository linked from the arXiv record. That makes the numerical claims paper-reported rather than locally reproduced.

Sources


Return to Blog